Hotel Booking Demand¶

Course Code: AIT201
Course Name: Applied Machine Learning
Lecturer: Dr Raheel Zafar
Academic Session: 2024/04
Submission Due Date: 1/07/2024

Prepared by:
DSC2209674 - CHEONG YI JUN
DSC2209676 - LEE SHAN YAN
DSC2209677 - LEE ZI HOOI
DSC2209680 - PANG YI
DSC2209995 - FOO YUN HAN

Own Work Declaration¶

We hereby understand my/our work would be checked for plagiarism or other misconduct, and the softcopy would be saved for future comparison(s).

We hereby confirm that all the references or sources of citations have been correctly listed or presented and I/we clearly understand the serious consequence caused by any intentional or unintentional misconduct.

This work is not made on any work of other students (past or present), and it has not been submitted to any other courses or institutions before.

Date: 30/6/2024

Signature:

Signatures

1.0 Introduction¶

1.1 Overview of the Project¶

Nowadays, the hotel industry is in a dynamic and competitive environment due to the travel industry continuing to grow. Therefore, managing booking cancellations is a key aspect of improving operational efficiency and optimising the revenue of the hotels. This is because, hotel booking cancellations can have a significant impact on hotel revenues, occupancy and resource allocation. However, cancellations are often due to a variety of reasons, including changes in travel plans, price sensitivity or booking errors. Thus, understanding and predicting these cancellations can bring significant benefits to hotel management and enable better customer service.

This project is focused on predicting hotel booking cancellations using machine learning techniques. We are analysing historical booking data and then using it to identify the patterns and factors that lead to cancellations.

1.2 Objectives and Goals¶

The primary objective of this project is to build and evaluate predictive models that can accurately forecast hotel booking cancellations. Thus, our goals include:

  1. To identify the patterns and factors contributing to cancellations.

  2. To create and validate machine learning models that predict hotel booking cancellations.

  3. To provide actionable insights that help hotels reduce cancellation rates and improve resource allocation.

1.3 Description of the Dataset¶

To achieve the objectives of predicting hotel booking cancellations and identifying the patterns and factors, we utilised a comprehensive dataset that provides detailed information about bookings in two hotels: Resort Hotel and City Hotel. This dataset serves as the foundation for our analysis and model development.

Key Features and Structure of the Dataset:

Attribute Description Data Type
hotel Hotel Type (Resort Hotel and City Hotel) Object
is_canceled Value indicating if the booking was cancelled (1 return cancelled, 0 is not cancelled) Integer
lead_time Number of days that elapsed between the entering date of the booking into the PMS and the arrival date Integer
arrival_date_year Year of arrival date Integer
arrival_date_month Month of arrival date with 12 categories (January to December) Object
arrival_date_week_number Week number of the arrival date Integer
arrival_date_day_of_month Day of the month of the arrival date Integer
stays_in_weekend_nights Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel Integer
stays_in_week_nights Number of weeknights (Monday to Friday) the guest stayed or booked to stay at the hotel Integer
adults Number of adults Integer
children Number of children Integer
babies Number of babies Integer
meal Type of meal booked. Categories are presented in standard hospitality meal packages: Object
Undefined/SC: no meal package
BB: Bed and Breakfast
HB: Half board (breakfast and one other meal – usually dinner)
FB: Full board (breakfast, lunch and dinner)
country Country of origin. Categories are represented in the ISO 3155–3:2013 format Object
market_segment Market segment designation (TA means Travel Agents and TO means Tour Operators) Object
distribution_channel Booking distribution channel (TA means Travel Agents and TO means Tour Operators) Object
is_repeated_guest Value indicating if the booking name was from a repeated guest (1 return repeated, 0 is not repeated) Integer
previous_cancellations Number of previous bookings that were cancelled by the customer before the current booking Integer
previous_bookings_not_canceled Number of previous bookings not cancelled by the customer before the current booking Integer
reserved_room_type Code of room type reserved Object
assigned_room_type Code for the type of room assigned to the booking Object
booking_changes Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation Integer
deposit_type Indication on if the customer deposited to guarantee the booking. This variable can assume three categories: Object
No Deposit: No deposit was made
Non Refund: A deposit was made in the value of the total stay cost
Refundable: A deposit was made with a value under the total cost of stay
agent ID of the travel agency that made the booking Float
company ID of the company/entity that made the booking or is responsible for paying the booking Float
days_in_waiting_list Number of days the booking was in the waiting list before it was confirmed to the customer Integer
customer_type Type of booking, assuming one of four categories: Object
Contract: When the booking has an allotment or other type of contract associated to it
Group: When the booking is associated with a group
Transient: When the booking is not part of a group or contract and is not associated with other transient booking
Transient-party: When the booking is transient, but is associated with at least other transient bookings
adr Average Daily Rate Float
required_car_parking_spaces Number of car parking spaces required by the customer Integer
total_of_special_requests Number of special requests made by the customer (e.g. twin bed or high floor) Integer
reservation_status Reservation last status, assuming one of three categories (Cancelled, Check-Out and No-show) Object
reservation_status_date The date at which the last status was set DateTime

The dataset contains approximately 119,390 observations and 32 features. It provides a robust base for both exploratory analysis and predictive modelling. It is structured in a tabular format with each row representing a unique booking and each column representing an attribute of that booking.

2.0 Exploratory Data Analysis¶

2.1 Data Preprocessing¶

The data is loaded from the CSV file and the first few rows are printed to give an initial view of the dataset

In [ ]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tabulate import tabulate
import calendar
import numpy as np
import geopandas as gpd
import pycountry
from IPython.display import display, Markdown
In [ ]:
# Load the dataset
file_path = r"hotel_bookings.csv"
df = pd.read_csv(file_path)

# Display the first few rows of the dataframe
print("First few rows of the dataset:")
df.head()
First few rows of the dataset:
Out[ ]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... No Deposit NaN NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... No Deposit 304.0 NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... No Deposit 240.0 NaN 0 Transient 98.0 0 1 Check-Out 2015-07-03

5 rows × 32 columns

Next, we get an overview of the data, showing the data types

In [ ]:
# Get an overview of the dataframe
print("\nDataframe info:")
df.info()
Dataframe info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal                            119390 non-null  object 
 13  country                         118902 non-null  object 
 14  market_segment                  119390 non-null  object 
 15  distribution_channel            119390 non-null  object 
 16  is_repeated_guest               119390 non-null  int64  
 17  previous_cancellations          119390 non-null  int64  
 18  previous_bookings_not_canceled  119390 non-null  int64  
 19  reserved_room_type              119390 non-null  object 
 20  assigned_room_type              119390 non-null  object 
 21  booking_changes                 119390 non-null  int64  
 22  deposit_type                    119390 non-null  object 
 23  agent                           103050 non-null  float64
 24  company                         6797 non-null    float64
 25  days_in_waiting_list            119390 non-null  int64  
 26  customer_type                   119390 non-null  object 
 27  adr                             119390 non-null  float64
 28  required_car_parking_spaces     119390 non-null  int64  
 29  total_of_special_requests       119390 non-null  int64  
 30  reservation_status              119390 non-null  object 
 31  reservation_status_date         119390 non-null  object 
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB

2.1.1 Checking Missing Values¶

We check for missing values in each column of our dataset

Additionally, we visualise the missing data using a heatmap

In [ ]:
# Check for missing values
print("\nMissing values in each column:")
print(df.isnull().sum())

# Optionally, visualize missing data
plt.figure(figsize=(12, 8))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title("Heatmap of Missing Values")
plt.show()
Missing values in each column:
hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company                           112593
days_in_waiting_list                   0
customer_type                          0
adr                                    0
required_car_parking_spaces            0
total_of_special_requests              0
reservation_status                     0
reservation_status_date                0
dtype: int64
No description has been provided for this image

2.1.2 Handling Missing Values¶

  1. we drop both agent and company columns as they are a lot of missing data in the column, and both these columns are considered not important in our coming analysis
  2. we fill countries (categorical column) with mode and children (numerical column) with mean, as we need the data for the columns
In [ ]:
# Handle missing values

# Fill missing values in the 'country' column with the mode
df['country'] = df['country'].fillna(df['country'].mode()[0])
# Fill missing values in 'children' column with mean before converting to int
df['children'] = df['children'].fillna(df['children'].mean())

# Drop the 'agent' and 'company' columns as they are not needed
df = df.drop(columns=['agent', 'company'])

# Verify missing values are handled
print("\nMissing values after handling:")
print(df.isnull().sum())
Missing values after handling:
hotel                             0
is_canceled                       0
lead_time                         0
arrival_date_year                 0
arrival_date_month                0
arrival_date_week_number          0
arrival_date_day_of_month         0
stays_in_weekend_nights           0
stays_in_week_nights              0
adults                            0
children                          0
babies                            0
meal                              0
country                           0
market_segment                    0
distribution_channel              0
is_repeated_guest                 0
previous_cancellations            0
previous_bookings_not_canceled    0
reserved_room_type                0
assigned_room_type                0
booking_changes                   0
deposit_type                      0
days_in_waiting_list              0
customer_type                     0
adr                               0
required_car_parking_spaces       0
total_of_special_requests         0
reservation_status                0
reservation_status_date           0
dtype: int64

Missing values are visualized and then filled. Numerical columns are filled with the mean, and categorical columns are filled with the mode. This ensures that no data is missing

2.1.3 Converting Data Types¶

Data transformation - converting to more appropriate data types

In [ ]:
# Convert data types if necessary (example: date columns)
# Suppose there's a column 'reservation_status_date' that should be datetime
df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'])

# Convert 'children' to int64
df['children'] = df['children'].astype('int64')

# Verify the changes
print("\nData types after conversion:\n")
print(df.info())
Data types after conversion:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 30 columns):
 #   Column                          Non-Null Count   Dtype         
---  ------                          --------------   -----         
 0   hotel                           119390 non-null  object        
 1   is_canceled                     119390 non-null  int64         
 2   lead_time                       119390 non-null  int64         
 3   arrival_date_year               119390 non-null  int64         
 4   arrival_date_month              119390 non-null  object        
 5   arrival_date_week_number        119390 non-null  int64         
 6   arrival_date_day_of_month       119390 non-null  int64         
 7   stays_in_weekend_nights         119390 non-null  int64         
 8   stays_in_week_nights            119390 non-null  int64         
 9   adults                          119390 non-null  int64         
 10  children                        119390 non-null  int64         
 11  babies                          119390 non-null  int64         
 12  meal                            119390 non-null  object        
 13  country                         119390 non-null  object        
 14  market_segment                  119390 non-null  object        
 15  distribution_channel            119390 non-null  object        
 16  is_repeated_guest               119390 non-null  int64         
 17  previous_cancellations          119390 non-null  int64         
 18  previous_bookings_not_canceled  119390 non-null  int64         
 19  reserved_room_type              119390 non-null  object        
 20  assigned_room_type              119390 non-null  object        
 21  booking_changes                 119390 non-null  int64         
 22  deposit_type                    119390 non-null  object        
 23  days_in_waiting_list            119390 non-null  int64         
 24  customer_type                   119390 non-null  object        
 25  adr                             119390 non-null  float64       
 26  required_car_parking_spaces     119390 non-null  int64         
 27  total_of_special_requests       119390 non-null  int64         
 28  reservation_status              119390 non-null  object        
 29  reservation_status_date         119390 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(17), object(11)
memory usage: 27.3+ MB
None

From the output above, we can observe that the data types for specific columns have been successfully modified

2.2 Data Visualisation and Insights¶

2.2.1 How Many Bookings Were Cancelled?¶

In [ ]:
# Calculate the number of cancellations
num_cancelled = df['is_canceled'].sum()
total_bookings = df.shape[0]

# Create a figure with 1 row and 2 columns
fig, axes = plt.subplots(1, 2, figsize=(18, 9))

# First plot: Total Bookings
ax = sns.countplot(x='is_canceled', data=df, palette='Blues', ax=axes[0])
axes[0].set_title('Total Bookings (Overall)')
axes[0].set_xlabel('Booking Status')
axes[0].set_ylabel('Number of Bookings')
axes[0].set_xticks([0, 1])
axes[0].set_xticklabels(['Not Cancelled', 'Cancelled'])

for p in ax.patches:
    ax.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='baseline', fontsize=12, color='black', xytext=(0, 5),
                textcoords='offset points')

# Second plot: Reservation status in different hotels
ax1 = sns.countplot(x='hotel', hue='is_canceled', data=df, palette='Blues', ax=axes[1])
axes[1].set_title('Total Bookings in Different Hotels', size=12, color='Black')
axes[1].set_xlabel('Hotel', color='Black')
axes[1].set_ylabel('Number of Booking', color='Black')

# Customize legend location
legend_labels, _ = ax1.get_legend_handles_labels()
axes[1].legend(bbox_to_anchor=(1, 1))

# Customize legend labels
axes[1].legend(['Not Cancelled', 'Cancelled'])

# Annotate each bar with the count value
for p in ax1.patches:
    ax1.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='baseline', fontsize=10, color='black', xytext=(0, 5),
                 textcoords='offset points')

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()

# Print the results
print(f"Total number of bookings: {total_bookings}")
print(f"Total number of bookings cancelled: {num_cancelled}")
C:\Users\Acer\AppData\Local\Temp\ipykernel_14172\269524683.py:9: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  ax = sns.countplot(x='is_canceled', data=df, palette='Blues', ax=axes[0])
No description has been provided for this image
Total number of bookings: 119390
Total number of bookings cancelled: 44224

Analysis:

The dataset contains a total of 119,390 bookings, while 44,224 of them were cancelled. The bar chart provides a clear visual representation, showing that a significant portion of the bookings (approximately 37%) were cancelled, while the majority (about 63%) were not cancelled. The sub plot displays the number of reservations for Resort Hotel and City Hotel, with further segmentation into cancelled and not cancelled bookings. The number of cancelled bookings for Resort Hotel is 11,122, accounting for 25% of the total cancelled bookings, while for City Hotel, there are 33,102 cancellations, making up 75% of the total cancelled bookings.

2.2.2 What is the booking ratio between Resort Hotel and City Hotel?¶

In [ ]:
# Calculate the total number of bookings for each hotel type
resort_bookings = df[df['hotel'] == 'Resort Hotel'].shape[0]
city_bookings = df[df['hotel'] == 'City Hotel'].shape[0]

# Calculate the booking ratio
booking_ratio = resort_bookings / city_bookings

# Visualization of booking distribution using a pie chart
labels = ['Resort Hotel', 'City Hotel']
sizes = [resort_bookings, city_bookings]
colors = ['#ff9999','#66b3ff']
explode = (0.1, 0)  # explode the 1st slice (Resort Hotel)

plt.figure(figsize=(8, 8))
plt.pie(sizes, explode=explode, labels=labels, colors = sns.color_palette('Blues'), autopct='%1.1f%%',
        shadow=True, startangle=140)
plt.title('Booking Ratio between Resort Hotel and City Hotel')
plt.show()

# Print the results
print(f"Total number of Resort Hotel bookings: {resort_bookings}")
print(f"Total number of City Hotel bookings: {city_bookings}")
print(f"Booking ratio (Resort Hotel : City Hotel) = 1:2")
No description has been provided for this image
Total number of Resort Hotel bookings: 40060
Total number of City Hotel bookings: 79330
Booking ratio (Resort Hotel : City Hotel) = 1:2

Analysis:

Resort Hotel has a total of 40,060 bookings while City Hotel has a total of 79,330 bookings. The booking ratio between Resort Hotel and City Hotel is 1:2. This means that for every booking at the Resort Hotel, there are approximately two bookings at the City Hotel. A pie chart visually represents this ratio, showing that City Hotel bookings constitute a larger portion of the total bookings (about 66.4%), while Resort Hotel bookings make up the remaining 33.6%.

2.2.3 What is the percentage of booking for each year?¶

In [ ]:
import matplotlib.pyplot as plt

# Assuming df is your DataFrame with the necessary data
# List of unique years in the dataset
years = sorted(df['arrival_date_year'].unique())

# Define colors for the pie charts
colors = ['lightcoral', 'skyblue']

# Create subplots: one for each year
fig, axes = plt.subplots(1, len(years), figsize=(18, 6))

# Initialize a dictionary to store percentages
booking_percentages = {}

for i, year in enumerate(years):
    # Filter data for the given year
    df_year = df[df['arrival_date_year'] == year]

    # Calculate the total number of bookings for each hotel
    total_resort_bookings = df_year[df_year['hotel'] == 'Resort Hotel'].shape[0]
    total_city_bookings = df_year[df_year['hotel'] == 'City Hotel'].shape[0]

    # Calculate the total number of bookings overall for the year
    total_bookings_year = total_resort_bookings + total_city_bookings

    # Compute the percentage of bookings for each hotel
    percentage_resort_bookings = (total_resort_bookings / total_bookings_year) * 100
    percentage_city_bookings = (total_city_bookings / total_bookings_year) * 100

    # Prepare data for the pie chart
    labels = ['Resort Hotel', 'City Hotel']
    sizes = [percentage_resort_bookings, percentage_city_bookings]

    # Plot pie chart for the year
    wedges, texts, autotexts = axes[i].pie(sizes, labels=labels, colors = sns.color_palette('Blues'), autopct='%1.2f%%', startangle=90)
    axes[i].axis('equal')  # Equal aspect ratio ensures the pie is drawn as a circle.
    axes[i].set_title(f'Year {year}')
    for text in texts + autotexts:
        text.set_fontsize(12)

    # Store the percentages in the dictionary
    booking_percentages[year] = {
        'Resort Hotel': percentage_resort_bookings,
        'City Hotel': percentage_city_bookings
    }

# Set the title for the entire figure
fig.suptitle('Percentage of Bookings for Each Hotel by Year', fontsize=16)

# Adjust layout to prevent overlap
plt.tight_layout()
plt.subplots_adjust(top=0.85)

# Display the plot
plt.show()

# Print the percentages after visualisation
print("Percentage of bookings for each year:")
for year in years:
    percentages = booking_percentages[year]
    print(f"Year {year}:")
    print(f"  Resort Hotel: {percentages['Resort Hotel']:.2f}%")
    print(f"  City Hotel: {percentages['City Hotel']:.2f}%")
    print()
No description has been provided for this image
Percentage of bookings for each year:
Year 2015:
  Resort Hotel: 37.80%
  City Hotel: 62.20%

Year 2016:
  Resort Hotel: 32.74%
  City Hotel: 67.26%

Year 2017:
  Resort Hotel: 32.39%
  City Hotel: 67.61%

Analysis:

In the years 2015 to 2017, there were a total of 21,996, 56,707, and 40,687 bookings respectively for both hotels. From the pie chart in Figure 9, we can clearly see the percentage distribution of bookings for each year. The table below shows the detailed percentages:

Hotels Year 2015 2016 2017
Resort Hotel 37.80% 32.74% 32.39%
City Hotel 62.20% 67.26% 67.61%

Therefore, it is visible that City Hotel has a clear dominance, capturing more than 60% of the bookings each year. In contrast, Resort Hotel accounts for less than 40%. This consistent trend indicates a strong preference for City Hotel over Resort Hotel across the three-year period.

2.2.4 Which is the busiest month for hotel?¶

In [ ]:
# Create a datetime column from year, month, and day
df['arrival_date'] = pd.to_datetime(df['arrival_date_year'].astype(str) + '-' +
                                    df['arrival_date_month'].astype(str) + '-' +
                                    df['arrival_date_day_of_month'].astype(str), 
                                    format='%Y-%B-%d')

# Assuming 'total_nights' is the sum of 'stays_in_weekend_nights' and 'stays_in_week_nights'
df['total_nights'] = df['stays_in_weekend_nights'] + df['stays_in_week_nights']

# Function to prepare data for the heatmap
def prepare_heatmap_data(hotel_type):
    # Filter data for the specific hotel type
    hotel_df = df[df['hotel'] == hotel_type]
    
    # Group by 'arrival_date_year' and 'arrival_date_month' and calculate total stays
    monthly_stays = hotel_df.groupby(['arrival_date_year', 'arrival_date_month'])[['total_nights']].sum().reset_index()
    
    # Convert month names to numbers for easier processing
    monthly_stays['arrival_date_month_num'] = pd.to_datetime(monthly_stays['arrival_date_month'], format='%B').dt.month
    
    return monthly_stays

# Function to create a calendar-style heatmap
def create_calendar_heatmap(ax, monthly_stays, hotel_name):
    heatmap_data = monthly_stays.pivot(index='arrival_date_year', columns='arrival_date_month_num', values='total_nights')
    
    # Create the heatmap
    sns.heatmap(heatmap_data, annot=True, fmt=".0f", cmap="Blues", cbar_kws={'label': 'Total Stays'}, linewidths=0.5, linecolor='gray', ax=ax)
    
    # Set the month names as x-axis labels
    ax.set_xticks(np.arange(12) + 0.5)
    ax.set_xticklabels([calendar.month_name[i] for i in range(1, 13)], rotation=45)
    
    # Set the year names as y-axis labels
    ax.set_yticks(np.arange(len(heatmap_data.index)) + 0.5)
    ax.set_yticklabels(heatmap_data.index, rotation=0)
    
    ax.set_xlabel('Month')
    ax.set_ylabel('Year')
    ax.set_title(f'{hotel_name}')

# Prepare data for Resort Hotel
resort_monthly_stays = prepare_heatmap_data('Resort Hotel')

# Prepare data for City Hotel
city_monthly_stays = prepare_heatmap_data('City Hotel')

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(24, 10))

# Create calendar heatmap for Resort Hotel
create_calendar_heatmap(axes[0], resort_monthly_stays, 'Resort Hotel')

# Create calendar heatmap for City Hotel
create_calendar_heatmap(axes[1], city_monthly_stays, 'City Hotel')

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()

# Calculate the average stays per month for each hotel across all years
monthly_stays_combined = df.groupby(['hotel', 'arrival_date_month'])[['total_nights']].sum().reset_index()
monthly_stays_combined['total_years'] = df.groupby(['hotel', 'arrival_date_month'])[['arrival_date_year']].nunique().reset_index(drop=True)
monthly_stays_combined['average_stays'] = monthly_stays_combined['total_nights'] / monthly_stays_combined['total_years']

# Pivot the data to have months as columns and hotels as rows
monthly_stays_pivot = monthly_stays_combined.pivot(index='arrival_date_month', columns='hotel', values='average_stays')

# Reindex the DataFrame to ensure the correct month order
months_order = ['January', 'February', 'March', 'April', 'May', 'June', 
                'July', 'August', 'September', 'October', 'November', 'December']
monthly_stays_pivot = monthly_stays_pivot.reindex(months_order)

# Plot the line graph
plt.figure(figsize=(14, 8))
plt.plot(monthly_stays_pivot.index, monthly_stays_pivot['Resort Hotel'], marker='o', label='Resort Hotel', color='blue')
plt.plot(monthly_stays_pivot.index, monthly_stays_pivot['City Hotel'], marker='o', label='City Hotel', color='red')
plt.title('Average Stays Per Month for Each Hotel')
plt.xlabel('Month')
plt.ylabel('Average Stays')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.show()

# Find the busiest month for each hotel
busiest_month_resort = monthly_stays_pivot['Resort Hotel'].idxmax()
busiest_month_city = monthly_stays_pivot['City Hotel'].idxmax()

print(f"The busiest month for Resort Hotel is {busiest_month_resort}.")
print(f"The busiest month for City Hotel is {busiest_month_city}.")
No description has been provided for this image
No description has been provided for this image
The busiest month for Resort Hotel is August.
The busiest month for City Hotel is May.

Analysis:

From the heat map, both hotels showed different colour patterns, Resort Hotel suggested a strong peak during the summer, possibly due to seasonal travel patterns; whereas City Hotel showed a consistent peak booking pattern throughout the year.

To conclude the questions of which is the busiest month for hotel, we are using a line plot to visualise it. In the line plot we can observe the busiest month for City Hotel is in May, and it is August for Resort Hotel. This also indicates that City Hotel attracts guest more evenly across different months compared to Resort Hotel’s concentrated summer demand.

2.2.5 From which country most guest come?¶

In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import pandas as pd
import pycountry

# Function to get country name from country code
def get_country_name(country_code):
    try:
        return pycountry.countries.get(alpha_3=country_code).name
    except AttributeError:
        return None

# Function to plot top 10 countries (bar)
def plot_top_countries(ax, country_counts, hotel_type):
    top_10_countries = country_counts.head(10)
    
    sns.barplot(ax=ax, x='count', y='country_name', data=top_10_countries, palette='inferno')
    ax.set_title(f'Top 10 Countries with the Most Guests ({hotel_type})', fontsize=18)
    ax.set_xlabel('Number of Guests', fontsize=14)
    ax.set_ylabel('Country', fontsize=14)

# Function to plot global map
def plot_global_map(ax, country_counts, hotel_type):
    world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
    world = world[['iso_a3', 'geometry']]

    # Simplify the geometries to avoid multi-part geometry issues
    world['geometry'] = world['geometry'].apply(lambda x: x.simplify(0.1) if x.geom_type == 'MultiPolygon' else x)

    # Merge the country counts with the world map
    merged = world.set_index('iso_a3').join(country_counts.set_index('country'))

    # Handle countries that are missing in the dataset
    missing_countries = merged[merged['count'].isna()].index

    # Convert missing country codes to country names
    missing_country_names = [get_country_name(code) for code in missing_countries]

    # Fill missing countries with 0 count
    merged['count'] = merged['count'].fillna(0)

    # Define custom bin edges
    bin_edges = [0, 1, 4, 9, 20, 50, 100, 600, 50000]
    labels = [f'{bin_edges[i]} - {bin_edges[i+1]}' for i in range(len(bin_edges)-1)]

    # Assign bin labels to the data
    merged['bin'] = pd.cut(merged['count'], bins=bin_edges, labels=labels, include_lowest=True)

    # Plotting the map with custom bins
    merged.plot(column='bin', cmap='Blues', linewidth=1, ax=ax, edgecolor='0.6', legend=True)

    # Customize the legend position and size
    legend = ax.get_legend()
    legend.set_bbox_to_anchor((0.16, 0.6))  # Adjusted position
    for text in legend.get_texts():
        text.set_fontsize(15)  # Adjust the font size as needed

    # Add axis labels
    ax.set_xlabel('Longitude', fontsize=14)
    ax.set_ylabel('Latitude', fontsize=14)

    ax.set_title(f'Global Distribution of Hotel Guests ({hotel_type})', fontsize=18)

    return missing_country_names

# Process each hotel type separately and print the top 10 list
missing_countries_per_hotel = {}

# Create subplots for bar plots and global maps side by side for each hotel type
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(25, 20), gridspec_kw={'width_ratios': [1, 2]})

text = """
\n\nBy looking at the **top 10 countries lists** below based on the analysis of hotel booking data, 
    we can observe that **the country that most guests come from for Resort Hotel** is: **Portugal**, having a count of 18094 guests, with **United Kingdom** and **Spain** following closely behind. 
    \nSimilarly for **City Hotel**, **Portugal remains the top source of guests** for the hotel with the highest count of 30984 guests, 
    with **France** and **Germany** also contributing significantly. 
    With this analysis, this indicates that **Portugal and most neighbouring European countries are the primary markets for both hotels**.\n
"""

display(Markdown(text))

for i, hotel_type in enumerate(df['hotel'].unique()):
    hotel_data = df[df['hotel'] == hotel_type]
    
    # Extract and count the occurrences of each country code
    country_counts = hotel_data['country'].value_counts().reset_index()
    country_counts.columns = ['country', 'count']
    country_counts['country_name'] = country_counts['country'].apply(get_country_name)

    print(f"\nTop 10 countries for {hotel_type}: \n")
    print(country_counts.head(10))

    # Determine the country with the most guests
    top_country = country_counts.iloc[0]
    print(f"\nCountry with the most guests for {hotel_type}: {top_country['country_name']} with {top_country['count']} guests\n")

    # Plot top 10 countries
    plot_top_countries(axes[i, 0], country_counts, hotel_type)

    # Plot global map
    missing_countries = plot_global_map(axes[i, 1], country_counts, hotel_type)
    missing_countries_per_hotel[hotel_type] = missing_countries

plt.tight_layout()
plt.show()

# Combined analysis for the total dataset
combined_country_counts = df['country'].value_counts().reset_index()
combined_country_counts.columns = ['country', 'count']
combined_country_counts['country_name'] = combined_country_counts['country'].apply(get_country_name)


text = """
\n\nThe global map visualisation below aids in identifying regions with high guest density, where **darker colour regions represent areas where more guests are coming from.** 
This provides insights into potential markets to help the hotels target into for future marketing campaigns or advertisements in a general and wider scope.

\n\nOn top of that, we also provided an **overall analysis** of the hotel booking data for the **country with most guests coming from.** 
Based on the visualisations and lists below, it appears that **Portugal again emerges as the leading source of guests for both City Hotel and Resort Hotel.** 
The **top 10 countries contributing the most guests** are Portugal, the United Kingdom, France, Spain, Germany, Italy, Ireland, Belgium, the Netherlands, 
and the United States, highlighting that European countries are the primary markets.\n
"""

display(Markdown(text))

print("\nTop 10 countries overall: \n")
print(combined_country_counts.head(10))

# Determine the country with the most guests overall
top_country_overall = combined_country_counts.iloc[0]
print(f"Country with the most guests overall: {top_country_overall['country_name']} with {top_country_overall['count']} guests")

# Create subplots for the overall bar plot and global map side by side
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(25, 10), gridspec_kw={'width_ratios': [1, 2]})

# Plot top 10 countries for the total dataset
plot_top_countries(axes[0], combined_country_counts, "Overall")

# Plot global map for the total dataset
missing_countries_overall = plot_global_map(axes[1], combined_country_counts, "Overall")

plt.tight_layout()
plt.show()

text = """
\n\nBesides, a **list of missing countries** in the dataset is also displayed in the code, 
indicating that these are the countries that are not visualised and analysed in our analysis due to the gaps in data collection.\n
"""

display(Markdown(text))

# Print missing countries for each hotel
for hotel_type, missing_countries in missing_countries_per_hotel.items():
    print(f"\nMissing countries in the dataset for {hotel_type} ({len(missing_countries)}):\n")
    for country in missing_countries:
        print(country)

# Print missing countries for the overall dataset
print(f"\nMissing countries in the overall dataset ({len(missing_countries_overall)}):\n")
for country in missing_countries_overall:
    print(country)

By looking at the top 10 countries lists below based on the analysis of hotel booking data, we can observe that the country that most guests come from for Resort Hotel is: Portugal, having a count of 18094 guests, with United Kingdom and Spain following closely behind.

Similarly for City Hotel, Portugal remains the top source of guests for the hotel with the highest count of 30984 guests, with France and Germany also contributing significantly. With this analysis, this indicates that Portugal and most neighbouring European countries are the primary markets for both hotels.

Top 10 countries for Resort Hotel: 

  country  count    country_name
0     PRT  18094        Portugal
1     GBR   6814  United Kingdom
2     ESP   3957           Spain
3     IRL   2166         Ireland
4     FRA   1611          France
5     DEU   1203         Germany
6      CN    710            None
7     NLD    514     Netherlands
8     USA    479   United States
9     ITA    459           Italy

Country with the most guests for Resort Hotel: Portugal with 18094 guests

C:\Users\Acer\AppData\Local\Temp\ipykernel_14172\160880919.py:18: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(ax=ax, x='count', y='country_name', data=top_10_countries, palette='inferno')
C:\Users\Acer\AppData\Local\Temp\ipykernel_14172\160880919.py:25: FutureWarning: The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.
  world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
Top 10 countries for City Hotel: 

  country  count    country_name
0     PRT  30984        Portugal
1     FRA   8804          France
2     DEU   6084         Germany
3     GBR   5315  United Kingdom
4     ESP   4611           Spain
5     ITA   3307           Italy
6     BEL   1894         Belgium
7     BRA   1794          Brazil
8     USA   1618   United States
9     NLD   1590     Netherlands

Country with the most guests for City Hotel: Portugal with 30984 guests

C:\Users\Acer\AppData\Local\Temp\ipykernel_14172\160880919.py:18: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(ax=ax, x='count', y='country_name', data=top_10_countries, palette='inferno')
C:\Users\Acer\AppData\Local\Temp\ipykernel_14172\160880919.py:25: FutureWarning: The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.
  world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
No description has been provided for this image

The global map visualisation below aids in identifying regions with high guest density, where darker colour regions represent areas where more guests are coming from. This provides insights into potential markets to help the hotels target into for future marketing campaigns or advertisements in a general and wider scope.

On top of that, we also provided an overall analysis of the hotel booking data for the country with most guests coming from. Based on the visualisations and lists below, it appears that Portugal again emerges as the leading source of guests for both City Hotel and Resort Hotel. The top 10 countries contributing the most guests are Portugal, the United Kingdom, France, Spain, Germany, Italy, Ireland, Belgium, the Netherlands, and the United States, highlighting that European countries are the primary markets.

Top 10 countries overall: 

  country  count    country_name
0     PRT  49078        Portugal
1     GBR  12129  United Kingdom
2     FRA  10415          France
3     ESP   8568           Spain
4     DEU   7287         Germany
5     ITA   3766           Italy
6     IRL   3375         Ireland
7     BEL   2342         Belgium
8     BRA   2224          Brazil
9     NLD   2104     Netherlands
Country with the most guests overall: Portugal with 49078 guests
C:\Users\Acer\AppData\Local\Temp\ipykernel_14172\160880919.py:18: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(ax=ax, x='count', y='country_name', data=top_10_countries, palette='inferno')
C:\Users\Acer\AppData\Local\Temp\ipykernel_14172\160880919.py:25: FutureWarning: The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.
  world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
No description has been provided for this image

Besides, a list of missing countries in the dataset is also displayed in the code, indicating that these are the countries that are not visualised and analysed in our analysis due to the gaps in data collection.

Missing countries in the dataset for Resort Hotel (70):

Tanzania, United Republic of
Western Sahara
Canada
Papua New Guinea
Congo, The Democratic Republic of the
Somalia
Kenya
Sudan
Chad
Haiti
Falkland Islands (Malvinas)
Greenland
French Southern Territories
Timor-Leste
Lesotho
Bolivia, Plurinational State of
Panama
Nicaragua
Honduras
El Salvador
Guatemala
Belize
Guyana
Namibia
Mali
Mauritania
Benin
Niger
Ghana
Guinea
Guinea-Bissau
Liberia
Sierra Leone
Burkina Faso
Congo
Gabon
Equatorial Guinea
Eswatini
Palestine, State of
Gambia
Iraq
Vanuatu
Cambodia
Lao People's Democratic Republic
Myanmar
Korea, Democratic People's Republic of
Mongolia
Bangladesh
Bhutan
Afghanistan
Tajikistan
Kyrgyzstan
Turkmenistan
Moldova, Republic of
New Caledonia
Solomon Islands
Brunei Darussalam
Eritrea
Paraguay
Yemen
Antarctica
None
Libya
Ethiopia
None
Rwanda
Montenegro
None
Trinidad and Tobago
South Sudan

Missing countries in the dataset for City Hotel (45):

Fiji
Western Sahara
Canada
Papua New Guinea
Congo, The Democratic Republic of the
Somalia
Chad
Haiti
Bahamas
Falkland Islands (Malvinas)
Greenland
Timor-Leste
Lesotho
Belize
Botswana
Niger
Guinea
Liberia
Congo
Equatorial Guinea
Malawi
Eswatini
Burundi
Madagascar
Palestine, State of
Gambia
Vanuatu
Korea, Democratic People's Republic of
Mongolia
Bhutan
Nepal
Afghanistan
Kyrgyzstan
Turkmenistan
Moldova, Republic of
Solomon Islands
Brunei Darussalam
Eritrea
Yemen
None
Djibouti
None
None
Trinidad and Tobago
South Sudan

Missing countries in the overall dataset (37):

Western Sahara
Canada
Papua New Guinea
Congo, The Democratic Republic of the
Somalia
Chad
Haiti
Falkland Islands (Malvinas)
Greenland
Timor-Leste
Lesotho
Belize
Niger
Guinea
Liberia
Congo
Equatorial Guinea
Eswatini
Palestine, State of
Gambia
Vanuatu
Korea, Democratic People's Republic of
Mongolia
Bhutan
Afghanistan
Kyrgyzstan
Turkmenistan
Moldova, Republic of
Solomon Islands
Brunei Darussalam
Eritrea
Yemen
None
None
None
Trinidad and Tobago
South Sudan

2.2.6: How Long People Stay in the hotel?¶

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
from tabulate import tabulate

# Filter out cancelled bookings
df_successful = df[df['is_canceled'] == 0].copy()

# Calculate total length of stay (total nights)
df_successful['total_nights'] = df_successful['stays_in_weekend_nights'] + df_successful['stays_in_week_nights']

# Filter data for Resort Hotel and City Hotel
resort_hotel = df_successful[df_successful['hotel'] == 'Resort Hotel']
city_hotel = df_successful[df_successful['hotel'] == 'City Hotel']

# Get the count of bookings for each length of stay
resort_nights_count = resort_hotel['total_nights'].value_counts().sort_index()
city_nights_count = city_hotel['total_nights'].value_counts().sort_index()

# Create DataFrames for plotting
resort_data = pd.DataFrame({
    'No. of nights': resort_nights_count.index,
    'No. of bookings': resort_nights_count.values
})

city_data = pd.DataFrame({
    'No. of nights': city_nights_count.index,
    'No. of bookings': city_nights_count.values
})

# Plot both line plots on a single figure
plt.figure(figsize=(14, 7))
plt.plot(resort_data['No. of nights'], resort_data['No. of bookings'], marker='o', linestyle='-', color='blue', label='Resort Hotel')
plt.plot(city_data['No. of nights'], city_data['No. of bookings'], marker='o', linestyle='-', color='red', label='City Hotel')
plt.title('Number of Bookings for Different Lengths of Stay\n', fontsize = 18)
plt.xlabel('\nNo. of Nights', fontsize = 15)
plt.ylabel('No. of Bookings\n', fontsize = 15)
plt.xticks(fontsize = 14)
plt.yticks(fontsize = 13)
plt.grid()
plt.legend()
plt.show()

print()
text = """
\n\nFrom the line plot above that describes the number of bookings for different lengths of stay, **we observe a significant skew to the left**:
- There is a low variation on the right; the right side of the plot is relatively flat.
- This flatness suggests that the number of bookings does not change significantly beyond a certain length of stay.
- The lack of variation in bookings for longer stays implies that these data points do not provide additional insights into booking patterns.
- Hence, to better understand booking trends, **we focus our analysis on stays of 15 days or fewer, where the data is more informative**.\n
"""

display(Markdown(text))
print()

# Filter for stays of 15 nights or fewer
resort_hotel_15 = resort_hotel[resort_hotel['total_nights'] <= 15]
city_hotel_15 = city_hotel[city_hotel['total_nights'] <= 15]

# Get the count of bookings for each length of stay (≤ 15 nights)
resort_nights_count_15 = resort_hotel_15['total_nights'].value_counts().sort_index()
city_nights_count_15 = city_hotel_15['total_nights'].value_counts().sort_index()

# Create DataFrames for plotting
resort_data_15 = pd.DataFrame({
    'No. of nights': resort_nights_count_15.index,
    'No. of bookings': resort_nights_count_15.values
})

city_data_15 = pd.DataFrame({
    'No. of nights': city_nights_count_15.index,
    'No. of bookings': city_nights_count_15.values
})

# Create a single plot with overlapping lines
fig, ax = plt.subplots(figsize=(14, 10))

ax.plot(resort_data_15['No. of nights'], resort_data_15['No. of bookings'], marker='o', color='blue', label='Resort Hotel')
ax.plot(city_data_15['No. of nights'], city_data_15['No. of bookings'], marker='o', color='red', label='City Hotel')

ax.set_title('Number of Bookings for Different Lengths of Stay (≤ 15 Nights)\n', fontsize = 18)
ax.set_xlabel('\nNo. of Nights', fontsize = 15)
ax.set_ylabel('No. of Bookings\n', fontsize = 15)
plt.xticks(range(16), fontsize = 14)
plt.yticks(fontsize = 13)
ax.grid()
ax.legend()
plt.tight_layout()
plt.show()

print()
text = """
\n\nBelow is the list of number of bookings for each length of stay sorted by number of nights: 
"""

display(Markdown(text))

# Create DataFrames for the number of bookings
df_resort = resort_nights_count.reset_index()
df_resort.columns = ['No. of nights', 'Resort Hotel Bookings']

df_city = city_nights_count.reset_index()
df_city.columns = ['No. of nights', 'City Hotel Bookings']

# Merge DataFrames
df_combined = pd.merge(df_resort, df_city, on='No. of nights', how='outer').fillna(0)

# Convert booking counts to integers
df_combined['Resort Hotel Bookings'] = df_combined['Resort Hotel Bookings'].astype(int)
df_combined['City Hotel Bookings'] = df_combined['City Hotel Bookings'].astype(int)

# Sort by 'No. of nights'
df_combined.sort_values(by='No. of nights', inplace=True)

# Print the combined DataFrame using tabulate
print(tabulate(df_combined, headers='keys', tablefmt='grid', showindex=False))
No description has been provided for this image

From the line plot above that describes the number of bookings for different lengths of stay, we observe a significant skew to the left:

  • There is a low variation on the right; the right side of the plot is relatively flat.
  • This flatness suggests that the number of bookings does not change significantly beyond a certain length of stay.
  • The lack of variation in bookings for longer stays implies that these data points do not provide additional insights into booking patterns.
  • Hence, to better understand booking trends, we focus our analysis on stays of 15 days or fewer, where the data is more informative.

No description has been provided for this image

Below is the list of number of bookings for each length of stay sorted by number of nights:

+-----------------+-------------------------+-----------------------+
|   No. of nights |   Resort Hotel Bookings |   City Hotel Bookings |
+=================+=========================+=======================+
|               0 |                     372 |                   308 |
+-----------------+-------------------------+-----------------------+
|               1 |                    6580 |                  9169 |
+-----------------+-------------------------+-----------------------+
|               2 |                    4488 |                 10992 |
+-----------------+-------------------------+-----------------------+
|               3 |                    3830 |                 11895 |
+-----------------+-------------------------+-----------------------+
|               4 |                    3321 |                  7704 |
+-----------------+-------------------------+-----------------------+
|               5 |                    1900 |                  3221 |
+-----------------+-------------------------+-----------------------+
|               6 |                    1206 |                  1116 |
+-----------------+-------------------------+-----------------------+
|               7 |                    4435 |                  1251 |
+-----------------+-------------------------+-----------------------+
|               8 |                     511 |                   209 |
+-----------------+-------------------------+-----------------------+
|               9 |                     408 |                   120 |
+-----------------+-------------------------+-----------------------+
|              10 |                     700 |                    83 |
+-----------------+-------------------------+-----------------------+
|              11 |                     240 |                    36 |
+-----------------+-------------------------+-----------------------+
|              12 |                      90 |                    35 |
+-----------------+-------------------------+-----------------------+
|              13 |                      75 |                    16 |
+-----------------+-------------------------+-----------------------+
|              14 |                     630 |                    29 |
+-----------------+-------------------------+-----------------------+
|              15 |                      23 |                    16 |
+-----------------+-------------------------+-----------------------+
|              16 |                      12 |                     6 |
+-----------------+-------------------------+-----------------------+
|              17 |                      11 |                     4 |
+-----------------+-------------------------+-----------------------+
|              18 |                       5 |                     1 |
+-----------------+-------------------------+-----------------------+
|              19 |                       4 |                     2 |
+-----------------+-------------------------+-----------------------+
|              20 |                       0 |                     1 |
+-----------------+-------------------------+-----------------------+
|              21 |                      35 |                     1 |
+-----------------+-------------------------+-----------------------+
|              22 |                       7 |                     3 |
+-----------------+-------------------------+-----------------------+
|              23 |                       1 |                     1 |
+-----------------+-------------------------+-----------------------+
|              24 |                       0 |                     1 |
+-----------------+-------------------------+-----------------------+
|              25 |                      14 |                     0 |
+-----------------+-------------------------+-----------------------+
|              27 |                       0 |                     1 |
+-----------------+-------------------------+-----------------------+
|              28 |                      22 |                     1 |
+-----------------+-------------------------+-----------------------+
|              29 |                       2 |                     1 |
+-----------------+-------------------------+-----------------------+
|              30 |                       2 |                     0 |
+-----------------+-------------------------+-----------------------+
|              34 |                       0 |                     1 |
+-----------------+-------------------------+-----------------------+
|              35 |                       5 |                     0 |
+-----------------+-------------------------+-----------------------+
|              38 |                       1 |                     0 |
+-----------------+-------------------------+-----------------------+
|              42 |                       3 |                     0 |
+-----------------+-------------------------+-----------------------+
|              43 |                       0 |                     1 |
+-----------------+-------------------------+-----------------------+
|              45 |                       1 |                     0 |
+-----------------+-------------------------+-----------------------+
|              46 |                       1 |                     0 |
+-----------------+-------------------------+-----------------------+
|              48 |                       0 |                     1 |
+-----------------+-------------------------+-----------------------+
|              49 |                       0 |                     1 |
+-----------------+-------------------------+-----------------------+
|              56 |                       1 |                     0 |
+-----------------+-------------------------+-----------------------+
|              57 |                       0 |                     1 |
+-----------------+-------------------------+-----------------------+
|              60 |                       1 |                     0 |
+-----------------+-------------------------+-----------------------+
|              69 |                       1 |                     0 |
+-----------------+-------------------------+-----------------------+

Analysis:

The analysis of the length of stay at the two types of hotels reveals distinct patterns in booking behaviour. From the line plot showing the number of bookings for different lengths of stay (up to 15 nights), it is evident that guests at City Hotel tend to book shorter stays compared to those at Resort Hotel. The peak for City Hotel (red line) occurs at 3 nights, with nearly 12,000 bookings, followed closely by 2-night stays with around 11,000 bookings. On the other hand, the peak for Resort Hotel (blue line) is at 1 night, with approximately 6,500 bookings. Beyond these durations of 1 to 3 nights, the number of bookings drops significantly, indicating that guests lose interest in longer stays at both types of hotels.

We can also conclude that 3 nights is the ideal duration for people to travel in a city, leading them to choose City Hotels for such a stay. In contrast, the ideal duration for people to stay in a resort is 1 night, likely influenced by the higher cost of Resort Hotels compared to City Hotels. Although this price difference is subjective, it explains why most people opt for a 1-night stay in Resort Hotels. However, if guests have a larger budget, they are more likely to choose a week-long stay in a Resort Hotel.

2.2.7 Which was the most booked accommodation type (Single, Couple, Family)?¶

In [ ]:
# Define the accommodation_type function
def accommodation_type(row):
    if row['adults'] == 1 and row['children'] == 0 and row['babies'] == 0:
        return 'Single'
    elif row['adults'] == 2 and row['children'] == 0 and row['babies'] == 0:
        return 'Couple'
    else:
        return 'Family'

# Create the 'accommodation_type' column
df['accommodation_type'] = df.apply(accommodation_type, axis=1)

# Separate data for City Hotel and Resort Hotel
city_hotel = df[df['hotel'] == 'City Hotel']
resort_hotel = df[df['hotel'] == 'Resort Hotel']

# Calculate the most booked accommodation type for City Hotel
city_accommodation_counts = city_hotel['accommodation_type'].value_counts()
most_booked_city_type = city_accommodation_counts.idxmax()

# Calculate the most booked accommodation type for Resort Hotel
resort_accommodation_counts = resort_hotel['accommodation_type'].value_counts()
most_booked_resort_type = resort_accommodation_counts.idxmax()

# Overall hotel
overall_booked_hotel = city_accommodation_counts.add(resort_accommodation_counts, fill_value=0)
most_booked_hotel_type = overall_booked_hotel.idxmax()

# Data visualization
f, axes = plt.subplots(1, 2, figsize=(24, 12))

# Plotting the overall bar chart
ax0 = sns.barplot(x=overall_booked_hotel.index, y=overall_booked_hotel.values, palette='Blues', ax=axes[0])
axes[0].set_xlabel('Accommodation Type', fontsize = 18)
axes[0].set_ylabel('Number of Bookings', fontsize = 18)
axes[0].set_title('Number of Bookings by Accommodation Type', fontsize = 18)

# Add percentage on top of each bar for the overall bookings
total_overall = overall_booked_hotel.sum()
for p in ax0.patches:
    percentage = f'{(p.get_height() / total_overall) * 100:.2f}%'
    ax0.annotate(f'{percentage}', (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='baseline', fontsize=12, color='black', xytext=(0, 5), 
                 textcoords='offset points')

# Combine the counts into a single DataFrame for plotting
accommodation_counts = pd.DataFrame({
    'Resort Hotel': resort_accommodation_counts,
    'City Hotel': city_accommodation_counts
}).fillna(0)

# Calculate overall percentages for each accommodation type
overall_percentages = (overall_booked_hotel / total_overall) * 100

# Plotting the grouped bar chart
ax1 = accommodation_counts.plot(kind='bar', color = sns.color_palette('Blues', n_colors=len(accommodation_counts.columns)), ax=axes[1])
axes[1].set_xlabel('Accommodation Type', fontsize = 18)
axes[1].set_ylabel('Number of Bookings', fontsize = 18)
axes[1].set_title('Number of Bookings by Accommodation Type and Hotel', fontsize = 18)
axes[1].set_xticks(range(len(accommodation_counts.index)))
axes[1].set_xticklabels(accommodation_counts.index, rotation=0)
axes[1].legend(title='Hotel', fontsize = 18)
axes[1].grid(axis='y', linestyle='--', alpha=0.7)

# Add percentage on top of each bar for grouped bar chart
for i, bar in enumerate(ax1.patches):
    # Determine which hotel the bar belongs to
    hotel = 'Resort Hotel' if i % 2 == 0 else 'City Hotel'
    
    # Get the count for the current bar
    count = bar.get_height()
    
    # Determine the total for the combined bars of the same accommodation type
    accommodation_type = accommodation_counts.index[int(i / 2)]
    combined_total = overall_booked_hotel.loc[accommodation_type]
    
    # Calculate the percentage for the current bar relative to the combined total
    percentage = (count / combined_total) * overall_percentages[accommodation_type]
    
    # Annotate the bar with the percentage
    ax1.annotate(f'{percentage:.2f}%', (bar.get_x() + bar.get_width() / 2., bar.get_height()),
                 ha='center', va='baseline', fontsize=12, color='black', xytext=(0, 5),
                 textcoords='offset points')

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()

#print result
print(f'Most Booked Accommodation Type: {most_booked_hotel_type}')
print(f'Most Booked Accommodation Type for City Hotel: {most_booked_city_type}')
print(f'Most Booked Accommodation Type for Resort Hotel: {most_booked_resort_type}')
C:\Users\Acer\AppData\Local\Temp\ipykernel_14172\680038083.py:33: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  ax0 = sns.barplot(x=overall_booked_hotel.index, y=overall_booked_hotel.values, palette='Blues', ax=axes[0])
No description has been provided for this image
Most Booked Accommodation Type: Couple
Most Booked Accommodation Type for City Hotel: Couple
Most Booked Accommodation Type for Resort Hotel: Couple

Analysis:

The left chart clearly illustrates the total number of bookings for each accommodation type without differentiating between hotel categories. From the graphs, we can analyse the Couples dominate with 68.31% of the total bookings, as indicated by the left most bar. This percentage is much higher than the other categories, suggesting most people prefer accommodations suited for couples. In contrast, the Singles type for 18.91% of the bookings represented by a middle bar, while the Families type makes up just 12.78% shown by the smallest right most bar. This chart conclusively shows that the Couple accommodation type is the most popular choice among all the options.

The right chart provides a detailed breakdown of the bookings by accommodation type and hotel type, distinguishing between Resort Hotels and City Hotels. Even with this division, the preference for the Couple category remains evident. For City Hotels, Couples make up 44.83% of the bookings, and for Resort Hotels, they have 23.83%. This combined total significantly surpasses the percentages for both Singles and Families in either hotel type. Single has 13.04% of bookings in City Hotels and 5.87% in Resort Hotels. Families trail further behind, with 8.51% in City Hotels and 4.20% in Resort Hotels. These numbers conclude that Couples are the most frequent guests across both hotel types.

In conclusion, the Couple accommodation type is the most booked. As we analyse, whether we look at the overall booking trends or the specific preferences in different types of hotels, Couples emerge as the most popular accommodation type.

3.0 Data Pre-processing¶

In [ ]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
import matplotlib.pyplot as plt

3.1: Feature Engineering¶

Create new columns: total_guests

In [ ]:
# Calculate total_guests
df['total_guests'] = df['adults'] + df['children'] + df['babies']

# Display the first few rows to verify the new columns
df[['adults', 'children', 'babies', 'total_guests']]
Out[ ]:
adults children babies total_guests
0 2 0 0 2
1 2 0 0 2
2 1 0 0 1
3 1 0 0 1
4 2 0 0 2
... ... ... ... ...
119385 2 0 0 2
119386 3 0 0 3
119387 2 0 0 2
119388 2 0 0 2
119389 2 0 0 2

119390 rows × 4 columns

It can be observed that the values in the new column are exactly the sum of values from the combined columns, thus the column ‘total_guests’ with new features is successfully created, helping us to reduce unnecessary and overlapping features in the dataset

Drop the unnecessary features¶

We will also delete the combined and unnecesarry features in our dataset by using the ‘drop’ function

In [ ]:
columns_to_drop = ['stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'arrival_date', 'accommodation_type']
df.drop(columns=columns_to_drop, inplace=True)

df.head()
Out[ ]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month meal country market_segment ... deposit_type days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date total_nights total_guests
0 Resort Hotel 0 342 2015 July 27 1 BB PRT Direct ... No Deposit 0 Transient 0.0 0 0 Check-Out 2015-07-01 0 2
1 Resort Hotel 0 737 2015 July 27 1 BB PRT Direct ... No Deposit 0 Transient 0.0 0 0 Check-Out 2015-07-01 0 2
2 Resort Hotel 0 7 2015 July 27 1 BB GBR Direct ... No Deposit 0 Transient 75.0 0 0 Check-Out 2015-07-02 1 1
3 Resort Hotel 0 13 2015 July 27 1 BB GBR Corporate ... No Deposit 0 Transient 75.0 0 0 Check-Out 2015-07-02 1 1
4 Resort Hotel 0 14 2015 July 27 1 BB GBR Online TA ... No Deposit 0 Transient 98.0 0 1 Check-Out 2015-07-03 2 2

5 rows × 27 columns

3.2 Data transformation¶

Identify and display categorical columns before transformation

In [ ]:
# Identify categorical columns
categorical_features = df.select_dtypes(include=['object', 'datetime64[ns]']).columns.tolist()

# Display categorical columns and their unique values count
categorical_summary = {col: df[col].nunique() for col in categorical_features}
categorical_summary
Out[ ]:
{'hotel': 2,
 'arrival_date_month': 12,
 'meal': 5,
 'country': 177,
 'market_segment': 8,
 'distribution_channel': 5,
 'reserved_room_type': 10,
 'assigned_room_type': 12,
 'deposit_type': 3,
 'customer_type': 4,
 'reservation_status': 3,
 'reservation_status_date': 926}

Transform categorical features to numerical values using label encoding

In [ ]:
label_encoders = {}
for col in categorical_features:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

    # Print the mapping of categorical values to numerical values
    print(f"Mapping for column '{col}':")
    for class_index, class_label in enumerate(le.classes_):
        print(f"{class_label}: {class_index}")
    print("\n")

# Display the first few rows of the transformed data
df.head()
Mapping for column 'hotel':
City Hotel: 0
Resort Hotel: 1


Mapping for column 'arrival_date_month':
April: 0
August: 1
December: 2
February: 3
January: 4
July: 5
June: 6
March: 7
May: 8
November: 9
October: 10
September: 11


Mapping for column 'meal':
BB: 0
FB: 1
HB: 2
SC: 3
Undefined: 4


Mapping for column 'country':
ABW: 0
AGO: 1
AIA: 2
ALB: 3
AND: 4
ARE: 5
ARG: 6
ARM: 7
ASM: 8
ATA: 9
ATF: 10
AUS: 11
AUT: 12
AZE: 13
BDI: 14
BEL: 15
BEN: 16
BFA: 17
BGD: 18
BGR: 19
BHR: 20
BHS: 21
BIH: 22
BLR: 23
BOL: 24
BRA: 25
BRB: 26
BWA: 27
CAF: 28
CHE: 29
CHL: 30
CHN: 31
CIV: 32
CMR: 33
CN: 34
COL: 35
COM: 36
CPV: 37
CRI: 38
CUB: 39
CYM: 40
CYP: 41
CZE: 42
DEU: 43
DJI: 44
DMA: 45
DNK: 46
DOM: 47
DZA: 48
ECU: 49
EGY: 50
ESP: 51
EST: 52
ETH: 53
FIN: 54
FJI: 55
FRA: 56
FRO: 57
GAB: 58
GBR: 59
GEO: 60
GGY: 61
GHA: 62
GIB: 63
GLP: 64
GNB: 65
GRC: 66
GTM: 67
GUY: 68
HKG: 69
HND: 70
HRV: 71
HUN: 72
IDN: 73
IMN: 74
IND: 75
IRL: 76
IRN: 77
IRQ: 78
ISL: 79
ISR: 80
ITA: 81
JAM: 82
JEY: 83
JOR: 84
JPN: 85
KAZ: 86
KEN: 87
KHM: 88
KIR: 89
KNA: 90
KOR: 91
KWT: 92
LAO: 93
LBN: 94
LBY: 95
LCA: 96
LIE: 97
LKA: 98
LTU: 99
LUX: 100
LVA: 101
MAC: 102
MAR: 103
MCO: 104
MDG: 105
MDV: 106
MEX: 107
MKD: 108
MLI: 109
MLT: 110
MMR: 111
MNE: 112
MOZ: 113
MRT: 114
MUS: 115
MWI: 116
MYS: 117
MYT: 118
NAM: 119
NCL: 120
NGA: 121
NIC: 122
NLD: 123
NOR: 124
NPL: 125
NZL: 126
OMN: 127
PAK: 128
PAN: 129
PER: 130
PHL: 131
PLW: 132
POL: 133
PRI: 134
PRT: 135
PRY: 136
PYF: 137
QAT: 138
ROU: 139
RUS: 140
RWA: 141
SAU: 142
SDN: 143
SEN: 144
SGP: 145
SLE: 146
SLV: 147
SMR: 148
SRB: 149
STP: 150
SUR: 151
SVK: 152
SVN: 153
SWE: 154
SYC: 155
SYR: 156
TGO: 157
THA: 158
TJK: 159
TMP: 160
TUN: 161
TUR: 162
TWN: 163
TZA: 164
UGA: 165
UKR: 166
UMI: 167
URY: 168
USA: 169
UZB: 170
VEN: 171
VGB: 172
VNM: 173
ZAF: 174
ZMB: 175
ZWE: 176


Mapping for column 'market_segment':
Aviation: 0
Complementary: 1
Corporate: 2
Direct: 3
Groups: 4
Offline TA/TO: 5
Online TA: 6
Undefined: 7


Mapping for column 'distribution_channel':
Corporate: 0
Direct: 1
GDS: 2
TA/TO: 3
Undefined: 4


Mapping for column 'reserved_room_type':
A: 0
B: 1
C: 2
D: 3
E: 4
F: 5
G: 6
H: 7
L: 8
P: 9


Mapping for column 'assigned_room_type':
A: 0
B: 1
C: 2
D: 3
E: 4
F: 5
G: 6
H: 7
I: 8
K: 9
L: 10
P: 11


Mapping for column 'deposit_type':
No Deposit: 0
Non Refund: 1
Refundable: 2


Mapping for column 'customer_type':
Contract: 0
Group: 1
Transient: 2
Transient-Party: 3


Mapping for column 'reservation_status':
Canceled: 0
Check-Out: 1
No-Show: 2


Mapping for column 'reservation_status_date':
2014-10-17T00:00:00.000000000: 0
2014-11-18T00:00:00.000000000: 1
2015-01-01T00:00:00.000000000: 2
2015-01-02T00:00:00.000000000: 3
2015-01-18T00:00:00.000000000: 4
2015-01-20T00:00:00.000000000: 5
2015-01-21T00:00:00.000000000: 6
2015-01-22T00:00:00.000000000: 7
2015-01-28T00:00:00.000000000: 8
2015-01-29T00:00:00.000000000: 9
2015-01-30T00:00:00.000000000: 10
2015-02-02T00:00:00.000000000: 11
2015-02-05T00:00:00.000000000: 12
2015-02-06T00:00:00.000000000: 13
2015-02-09T00:00:00.000000000: 14
2015-02-10T00:00:00.000000000: 15
2015-02-11T00:00:00.000000000: 16
2015-02-12T00:00:00.000000000: 17
2015-02-17T00:00:00.000000000: 18
2015-02-19T00:00:00.000000000: 19
2015-02-20T00:00:00.000000000: 20
2015-02-23T00:00:00.000000000: 21
2015-02-24T00:00:00.000000000: 22
2015-02-25T00:00:00.000000000: 23
2015-02-26T00:00:00.000000000: 24
2015-02-27T00:00:00.000000000: 25
2015-03-03T00:00:00.000000000: 26
2015-03-04T00:00:00.000000000: 27
2015-03-05T00:00:00.000000000: 28
2015-03-06T00:00:00.000000000: 29
2015-03-09T00:00:00.000000000: 30
2015-03-10T00:00:00.000000000: 31
2015-03-11T00:00:00.000000000: 32
2015-03-12T00:00:00.000000000: 33
2015-03-13T00:00:00.000000000: 34
2015-03-17T00:00:00.000000000: 35
2015-03-18T00:00:00.000000000: 36
2015-03-23T00:00:00.000000000: 37
2015-03-24T00:00:00.000000000: 38
2015-03-25T00:00:00.000000000: 39
2015-03-28T00:00:00.000000000: 40
2015-03-29T00:00:00.000000000: 41
2015-03-30T00:00:00.000000000: 42
2015-03-31T00:00:00.000000000: 43
2015-04-02T00:00:00.000000000: 44
2015-04-03T00:00:00.000000000: 45
2015-04-04T00:00:00.000000000: 46
2015-04-05T00:00:00.000000000: 47
2015-04-06T00:00:00.000000000: 48
2015-04-07T00:00:00.000000000: 49
2015-04-08T00:00:00.000000000: 50
2015-04-10T00:00:00.000000000: 51
2015-04-11T00:00:00.000000000: 52
2015-04-13T00:00:00.000000000: 53
2015-04-14T00:00:00.000000000: 54
2015-04-15T00:00:00.000000000: 55
2015-04-16T00:00:00.000000000: 56
2015-04-17T00:00:00.000000000: 57
2015-04-18T00:00:00.000000000: 58
2015-04-20T00:00:00.000000000: 59
2015-04-21T00:00:00.000000000: 60
2015-04-22T00:00:00.000000000: 61
2015-04-23T00:00:00.000000000: 62
2015-04-24T00:00:00.000000000: 63
2015-04-25T00:00:00.000000000: 64
2015-04-27T00:00:00.000000000: 65
2015-04-28T00:00:00.000000000: 66
2015-04-29T00:00:00.000000000: 67
2015-04-30T00:00:00.000000000: 68
2015-05-01T00:00:00.000000000: 69
2015-05-04T00:00:00.000000000: 70
2015-05-05T00:00:00.000000000: 71
2015-05-06T00:00:00.000000000: 72
2015-05-07T00:00:00.000000000: 73
2015-05-08T00:00:00.000000000: 74
2015-05-09T00:00:00.000000000: 75
2015-05-11T00:00:00.000000000: 76
2015-05-12T00:00:00.000000000: 77
2015-05-13T00:00:00.000000000: 78
2015-05-14T00:00:00.000000000: 79
2015-05-15T00:00:00.000000000: 80
2015-05-16T00:00:00.000000000: 81
2015-05-18T00:00:00.000000000: 82
2015-05-19T00:00:00.000000000: 83
2015-05-20T00:00:00.000000000: 84
2015-05-21T00:00:00.000000000: 85
2015-05-22T00:00:00.000000000: 86
2015-05-23T00:00:00.000000000: 87
2015-05-25T00:00:00.000000000: 88
2015-05-26T00:00:00.000000000: 89
2015-05-27T00:00:00.000000000: 90
2015-05-28T00:00:00.000000000: 91
2015-05-29T00:00:00.000000000: 92
2015-05-30T00:00:00.000000000: 93
2015-06-01T00:00:00.000000000: 94
2015-06-02T00:00:00.000000000: 95
2015-06-03T00:00:00.000000000: 96
2015-06-04T00:00:00.000000000: 97
2015-06-05T00:00:00.000000000: 98
2015-06-06T00:00:00.000000000: 99
2015-06-08T00:00:00.000000000: 100
2015-06-09T00:00:00.000000000: 101
2015-06-10T00:00:00.000000000: 102
2015-06-11T00:00:00.000000000: 103
2015-06-12T00:00:00.000000000: 104
2015-06-13T00:00:00.000000000: 105
2015-06-14T00:00:00.000000000: 106
2015-06-15T00:00:00.000000000: 107
2015-06-16T00:00:00.000000000: 108
2015-06-17T00:00:00.000000000: 109
2015-06-18T00:00:00.000000000: 110
2015-06-19T00:00:00.000000000: 111
2015-06-20T00:00:00.000000000: 112
2015-06-22T00:00:00.000000000: 113
2015-06-23T00:00:00.000000000: 114
2015-06-24T00:00:00.000000000: 115
2015-06-25T00:00:00.000000000: 116
2015-06-26T00:00:00.000000000: 117
2015-06-27T00:00:00.000000000: 118
2015-06-29T00:00:00.000000000: 119
2015-06-30T00:00:00.000000000: 120
2015-07-01T00:00:00.000000000: 121
2015-07-02T00:00:00.000000000: 122
2015-07-03T00:00:00.000000000: 123
2015-07-04T00:00:00.000000000: 124
2015-07-05T00:00:00.000000000: 125
2015-07-06T00:00:00.000000000: 126
2015-07-07T00:00:00.000000000: 127
2015-07-08T00:00:00.000000000: 128
2015-07-09T00:00:00.000000000: 129
2015-07-10T00:00:00.000000000: 130
2015-07-11T00:00:00.000000000: 131
2015-07-12T00:00:00.000000000: 132
2015-07-13T00:00:00.000000000: 133
2015-07-14T00:00:00.000000000: 134
2015-07-15T00:00:00.000000000: 135
2015-07-16T00:00:00.000000000: 136
2015-07-17T00:00:00.000000000: 137
2015-07-18T00:00:00.000000000: 138
2015-07-19T00:00:00.000000000: 139
2015-07-20T00:00:00.000000000: 140
2015-07-21T00:00:00.000000000: 141
2015-07-22T00:00:00.000000000: 142
2015-07-23T00:00:00.000000000: 143
2015-07-24T00:00:00.000000000: 144
2015-07-25T00:00:00.000000000: 145
2015-07-26T00:00:00.000000000: 146
2015-07-27T00:00:00.000000000: 147
2015-07-28T00:00:00.000000000: 148
2015-07-29T00:00:00.000000000: 149
2015-07-30T00:00:00.000000000: 150
2015-07-31T00:00:00.000000000: 151
2015-08-01T00:00:00.000000000: 152
2015-08-02T00:00:00.000000000: 153
2015-08-03T00:00:00.000000000: 154
2015-08-04T00:00:00.000000000: 155
2015-08-05T00:00:00.000000000: 156
2015-08-06T00:00:00.000000000: 157
2015-08-07T00:00:00.000000000: 158
2015-08-08T00:00:00.000000000: 159
2015-08-09T00:00:00.000000000: 160
2015-08-10T00:00:00.000000000: 161
2015-08-11T00:00:00.000000000: 162
2015-08-12T00:00:00.000000000: 163
2015-08-13T00:00:00.000000000: 164
2015-08-14T00:00:00.000000000: 165
2015-08-15T00:00:00.000000000: 166
2015-08-16T00:00:00.000000000: 167
2015-08-17T00:00:00.000000000: 168
2015-08-18T00:00:00.000000000: 169
2015-08-19T00:00:00.000000000: 170
2015-08-20T00:00:00.000000000: 171
2015-08-21T00:00:00.000000000: 172
2015-08-22T00:00:00.000000000: 173
2015-08-23T00:00:00.000000000: 174
2015-08-24T00:00:00.000000000: 175
2015-08-25T00:00:00.000000000: 176
2015-08-26T00:00:00.000000000: 177
2015-08-27T00:00:00.000000000: 178
2015-08-28T00:00:00.000000000: 179
2015-08-29T00:00:00.000000000: 180
2015-08-30T00:00:00.000000000: 181
2015-08-31T00:00:00.000000000: 182
2015-09-01T00:00:00.000000000: 183
2015-09-02T00:00:00.000000000: 184
2015-09-03T00:00:00.000000000: 185
2015-09-04T00:00:00.000000000: 186
2015-09-05T00:00:00.000000000: 187
2015-09-06T00:00:00.000000000: 188
2015-09-07T00:00:00.000000000: 189
2015-09-08T00:00:00.000000000: 190
2015-09-09T00:00:00.000000000: 191
2015-09-10T00:00:00.000000000: 192
2015-09-11T00:00:00.000000000: 193
2015-09-12T00:00:00.000000000: 194
2015-09-13T00:00:00.000000000: 195
2015-09-14T00:00:00.000000000: 196
2015-09-15T00:00:00.000000000: 197
2015-09-16T00:00:00.000000000: 198
2015-09-17T00:00:00.000000000: 199
2015-09-18T00:00:00.000000000: 200
2015-09-19T00:00:00.000000000: 201
2015-09-20T00:00:00.000000000: 202
2015-09-21T00:00:00.000000000: 203
2015-09-22T00:00:00.000000000: 204
2015-09-23T00:00:00.000000000: 205
2015-09-24T00:00:00.000000000: 206
2015-09-25T00:00:00.000000000: 207
2015-09-26T00:00:00.000000000: 208
2015-09-27T00:00:00.000000000: 209
2015-09-28T00:00:00.000000000: 210
2015-09-29T00:00:00.000000000: 211
2015-09-30T00:00:00.000000000: 212
2015-10-01T00:00:00.000000000: 213
2015-10-02T00:00:00.000000000: 214
2015-10-03T00:00:00.000000000: 215
2015-10-04T00:00:00.000000000: 216
2015-10-05T00:00:00.000000000: 217
2015-10-06T00:00:00.000000000: 218
2015-10-07T00:00:00.000000000: 219
2015-10-08T00:00:00.000000000: 220
2015-10-09T00:00:00.000000000: 221
2015-10-10T00:00:00.000000000: 222
2015-10-11T00:00:00.000000000: 223
2015-10-12T00:00:00.000000000: 224
2015-10-13T00:00:00.000000000: 225
2015-10-14T00:00:00.000000000: 226
2015-10-15T00:00:00.000000000: 227
2015-10-16T00:00:00.000000000: 228
2015-10-17T00:00:00.000000000: 229
2015-10-18T00:00:00.000000000: 230
2015-10-19T00:00:00.000000000: 231
2015-10-20T00:00:00.000000000: 232
2015-10-21T00:00:00.000000000: 233
2015-10-22T00:00:00.000000000: 234
2015-10-23T00:00:00.000000000: 235
2015-10-24T00:00:00.000000000: 236
2015-10-25T00:00:00.000000000: 237
2015-10-26T00:00:00.000000000: 238
2015-10-27T00:00:00.000000000: 239
2015-10-28T00:00:00.000000000: 240
2015-10-29T00:00:00.000000000: 241
2015-10-30T00:00:00.000000000: 242
2015-10-31T00:00:00.000000000: 243
2015-11-01T00:00:00.000000000: 244
2015-11-02T00:00:00.000000000: 245
2015-11-03T00:00:00.000000000: 246
2015-11-04T00:00:00.000000000: 247
2015-11-05T00:00:00.000000000: 248
2015-11-06T00:00:00.000000000: 249
2015-11-07T00:00:00.000000000: 250
2015-11-08T00:00:00.000000000: 251
2015-11-09T00:00:00.000000000: 252
2015-11-10T00:00:00.000000000: 253
2015-11-11T00:00:00.000000000: 254
2015-11-12T00:00:00.000000000: 255
2015-11-13T00:00:00.000000000: 256
2015-11-14T00:00:00.000000000: 257
2015-11-15T00:00:00.000000000: 258
2015-11-16T00:00:00.000000000: 259
2015-11-17T00:00:00.000000000: 260
2015-11-18T00:00:00.000000000: 261
2015-11-19T00:00:00.000000000: 262
2015-11-20T00:00:00.000000000: 263
2015-11-21T00:00:00.000000000: 264
2015-11-22T00:00:00.000000000: 265
2015-11-23T00:00:00.000000000: 266
2015-11-24T00:00:00.000000000: 267
2015-11-25T00:00:00.000000000: 268
2015-11-26T00:00:00.000000000: 269
2015-11-27T00:00:00.000000000: 270
2015-11-28T00:00:00.000000000: 271
2015-11-29T00:00:00.000000000: 272
2015-11-30T00:00:00.000000000: 273
2015-12-01T00:00:00.000000000: 274
2015-12-02T00:00:00.000000000: 275
2015-12-03T00:00:00.000000000: 276
2015-12-04T00:00:00.000000000: 277
2015-12-05T00:00:00.000000000: 278
2015-12-06T00:00:00.000000000: 279
2015-12-07T00:00:00.000000000: 280
2015-12-08T00:00:00.000000000: 281
2015-12-09T00:00:00.000000000: 282
2015-12-10T00:00:00.000000000: 283
2015-12-11T00:00:00.000000000: 284
2015-12-12T00:00:00.000000000: 285
2015-12-13T00:00:00.000000000: 286
2015-12-14T00:00:00.000000000: 287
2015-12-15T00:00:00.000000000: 288
2015-12-16T00:00:00.000000000: 289
2015-12-17T00:00:00.000000000: 290
2015-12-18T00:00:00.000000000: 291
2015-12-19T00:00:00.000000000: 292
2015-12-20T00:00:00.000000000: 293
2015-12-21T00:00:00.000000000: 294
2015-12-22T00:00:00.000000000: 295
2015-12-23T00:00:00.000000000: 296
2015-12-24T00:00:00.000000000: 297
2015-12-25T00:00:00.000000000: 298
2015-12-26T00:00:00.000000000: 299
2015-12-27T00:00:00.000000000: 300
2015-12-28T00:00:00.000000000: 301
2015-12-29T00:00:00.000000000: 302
2015-12-30T00:00:00.000000000: 303
2015-12-31T00:00:00.000000000: 304
2016-01-01T00:00:00.000000000: 305
2016-01-02T00:00:00.000000000: 306
2016-01-03T00:00:00.000000000: 307
2016-01-04T00:00:00.000000000: 308
2016-01-05T00:00:00.000000000: 309
2016-01-06T00:00:00.000000000: 310
2016-01-07T00:00:00.000000000: 311
2016-01-08T00:00:00.000000000: 312
2016-01-09T00:00:00.000000000: 313
2016-01-10T00:00:00.000000000: 314
2016-01-11T00:00:00.000000000: 315
2016-01-12T00:00:00.000000000: 316
2016-01-13T00:00:00.000000000: 317
2016-01-14T00:00:00.000000000: 318
2016-01-15T00:00:00.000000000: 319
2016-01-16T00:00:00.000000000: 320
2016-01-17T00:00:00.000000000: 321
2016-01-18T00:00:00.000000000: 322
2016-01-19T00:00:00.000000000: 323
2016-01-20T00:00:00.000000000: 324
2016-01-21T00:00:00.000000000: 325
2016-01-22T00:00:00.000000000: 326
2016-01-23T00:00:00.000000000: 327
2016-01-24T00:00:00.000000000: 328
2016-01-25T00:00:00.000000000: 329
2016-01-26T00:00:00.000000000: 330
2016-01-27T00:00:00.000000000: 331
2016-01-28T00:00:00.000000000: 332
2016-01-29T00:00:00.000000000: 333
2016-01-30T00:00:00.000000000: 334
2016-01-31T00:00:00.000000000: 335
2016-02-01T00:00:00.000000000: 336
2016-02-02T00:00:00.000000000: 337
2016-02-03T00:00:00.000000000: 338
2016-02-04T00:00:00.000000000: 339
2016-02-05T00:00:00.000000000: 340
2016-02-06T00:00:00.000000000: 341
2016-02-07T00:00:00.000000000: 342
2016-02-08T00:00:00.000000000: 343
2016-02-09T00:00:00.000000000: 344
2016-02-10T00:00:00.000000000: 345
2016-02-11T00:00:00.000000000: 346
2016-02-12T00:00:00.000000000: 347
2016-02-13T00:00:00.000000000: 348
2016-02-14T00:00:00.000000000: 349
2016-02-15T00:00:00.000000000: 350
2016-02-16T00:00:00.000000000: 351
2016-02-17T00:00:00.000000000: 352
2016-02-18T00:00:00.000000000: 353
2016-02-19T00:00:00.000000000: 354
2016-02-20T00:00:00.000000000: 355
2016-02-21T00:00:00.000000000: 356
2016-02-22T00:00:00.000000000: 357
2016-02-23T00:00:00.000000000: 358
2016-02-24T00:00:00.000000000: 359
2016-02-25T00:00:00.000000000: 360
2016-02-26T00:00:00.000000000: 361
2016-02-27T00:00:00.000000000: 362
2016-02-28T00:00:00.000000000: 363
2016-02-29T00:00:00.000000000: 364
2016-03-01T00:00:00.000000000: 365
2016-03-02T00:00:00.000000000: 366
2016-03-03T00:00:00.000000000: 367
2016-03-04T00:00:00.000000000: 368
2016-03-05T00:00:00.000000000: 369
2016-03-06T00:00:00.000000000: 370
2016-03-07T00:00:00.000000000: 371
2016-03-08T00:00:00.000000000: 372
2016-03-09T00:00:00.000000000: 373
2016-03-10T00:00:00.000000000: 374
2016-03-11T00:00:00.000000000: 375
2016-03-12T00:00:00.000000000: 376
2016-03-13T00:00:00.000000000: 377
2016-03-14T00:00:00.000000000: 378
2016-03-15T00:00:00.000000000: 379
2016-03-16T00:00:00.000000000: 380
2016-03-17T00:00:00.000000000: 381
2016-03-18T00:00:00.000000000: 382
2016-03-19T00:00:00.000000000: 383
2016-03-20T00:00:00.000000000: 384
2016-03-21T00:00:00.000000000: 385
2016-03-22T00:00:00.000000000: 386
2016-03-23T00:00:00.000000000: 387
2016-03-24T00:00:00.000000000: 388
2016-03-25T00:00:00.000000000: 389
2016-03-26T00:00:00.000000000: 390
2016-03-27T00:00:00.000000000: 391
2016-03-28T00:00:00.000000000: 392
2016-03-29T00:00:00.000000000: 393
2016-03-30T00:00:00.000000000: 394
2016-03-31T00:00:00.000000000: 395
2016-04-01T00:00:00.000000000: 396
2016-04-02T00:00:00.000000000: 397
2016-04-03T00:00:00.000000000: 398
2016-04-04T00:00:00.000000000: 399
2016-04-05T00:00:00.000000000: 400
2016-04-06T00:00:00.000000000: 401
2016-04-07T00:00:00.000000000: 402
2016-04-08T00:00:00.000000000: 403
2016-04-09T00:00:00.000000000: 404
2016-04-10T00:00:00.000000000: 405
2016-04-11T00:00:00.000000000: 406
2016-04-12T00:00:00.000000000: 407
2016-04-13T00:00:00.000000000: 408
2016-04-14T00:00:00.000000000: 409
2016-04-15T00:00:00.000000000: 410
2016-04-16T00:00:00.000000000: 411
2016-04-17T00:00:00.000000000: 412
2016-04-18T00:00:00.000000000: 413
2016-04-19T00:00:00.000000000: 414
2016-04-20T00:00:00.000000000: 415
2016-04-21T00:00:00.000000000: 416
2016-04-22T00:00:00.000000000: 417
2016-04-23T00:00:00.000000000: 418
2016-04-24T00:00:00.000000000: 419
2016-04-25T00:00:00.000000000: 420
2016-04-26T00:00:00.000000000: 421
2016-04-27T00:00:00.000000000: 422
2016-04-28T00:00:00.000000000: 423
2016-04-29T00:00:00.000000000: 424
2016-04-30T00:00:00.000000000: 425
2016-05-01T00:00:00.000000000: 426
2016-05-02T00:00:00.000000000: 427
2016-05-03T00:00:00.000000000: 428
2016-05-04T00:00:00.000000000: 429
2016-05-05T00:00:00.000000000: 430
2016-05-06T00:00:00.000000000: 431
2016-05-07T00:00:00.000000000: 432
2016-05-08T00:00:00.000000000: 433
2016-05-09T00:00:00.000000000: 434
2016-05-10T00:00:00.000000000: 435
2016-05-11T00:00:00.000000000: 436
2016-05-12T00:00:00.000000000: 437
2016-05-13T00:00:00.000000000: 438
2016-05-14T00:00:00.000000000: 439
2016-05-15T00:00:00.000000000: 440
2016-05-16T00:00:00.000000000: 441
2016-05-17T00:00:00.000000000: 442
2016-05-18T00:00:00.000000000: 443
2016-05-19T00:00:00.000000000: 444
2016-05-20T00:00:00.000000000: 445
2016-05-21T00:00:00.000000000: 446
2016-05-22T00:00:00.000000000: 447
2016-05-23T00:00:00.000000000: 448
2016-05-24T00:00:00.000000000: 449
2016-05-25T00:00:00.000000000: 450
2016-05-26T00:00:00.000000000: 451
2016-05-27T00:00:00.000000000: 452
2016-05-28T00:00:00.000000000: 453
2016-05-29T00:00:00.000000000: 454
2016-05-30T00:00:00.000000000: 455
2016-05-31T00:00:00.000000000: 456
2016-06-01T00:00:00.000000000: 457
2016-06-02T00:00:00.000000000: 458
2016-06-03T00:00:00.000000000: 459
2016-06-04T00:00:00.000000000: 460
2016-06-05T00:00:00.000000000: 461
2016-06-06T00:00:00.000000000: 462
2016-06-07T00:00:00.000000000: 463
2016-06-08T00:00:00.000000000: 464
2016-06-09T00:00:00.000000000: 465
2016-06-10T00:00:00.000000000: 466
2016-06-11T00:00:00.000000000: 467
2016-06-12T00:00:00.000000000: 468
2016-06-13T00:00:00.000000000: 469
2016-06-14T00:00:00.000000000: 470
2016-06-15T00:00:00.000000000: 471
2016-06-16T00:00:00.000000000: 472
2016-06-17T00:00:00.000000000: 473
2016-06-18T00:00:00.000000000: 474
2016-06-19T00:00:00.000000000: 475
2016-06-20T00:00:00.000000000: 476
2016-06-21T00:00:00.000000000: 477
2016-06-22T00:00:00.000000000: 478
2016-06-23T00:00:00.000000000: 479
2016-06-24T00:00:00.000000000: 480
2016-06-25T00:00:00.000000000: 481
2016-06-26T00:00:00.000000000: 482
2016-06-27T00:00:00.000000000: 483
2016-06-28T00:00:00.000000000: 484
2016-06-29T00:00:00.000000000: 485
2016-06-30T00:00:00.000000000: 486
2016-07-01T00:00:00.000000000: 487
2016-07-02T00:00:00.000000000: 488
2016-07-03T00:00:00.000000000: 489
2016-07-04T00:00:00.000000000: 490
2016-07-05T00:00:00.000000000: 491
2016-07-06T00:00:00.000000000: 492
2016-07-07T00:00:00.000000000: 493
2016-07-08T00:00:00.000000000: 494
2016-07-09T00:00:00.000000000: 495
2016-07-10T00:00:00.000000000: 496
2016-07-11T00:00:00.000000000: 497
2016-07-12T00:00:00.000000000: 498
2016-07-13T00:00:00.000000000: 499
2016-07-14T00:00:00.000000000: 500
2016-07-15T00:00:00.000000000: 501
2016-07-16T00:00:00.000000000: 502
2016-07-17T00:00:00.000000000: 503
2016-07-18T00:00:00.000000000: 504
2016-07-19T00:00:00.000000000: 505
2016-07-20T00:00:00.000000000: 506
2016-07-21T00:00:00.000000000: 507
2016-07-22T00:00:00.000000000: 508
2016-07-23T00:00:00.000000000: 509
2016-07-24T00:00:00.000000000: 510
2016-07-25T00:00:00.000000000: 511
2016-07-26T00:00:00.000000000: 512
2016-07-27T00:00:00.000000000: 513
2016-07-28T00:00:00.000000000: 514
2016-07-29T00:00:00.000000000: 515
2016-07-30T00:00:00.000000000: 516
2016-07-31T00:00:00.000000000: 517
2016-08-01T00:00:00.000000000: 518
2016-08-02T00:00:00.000000000: 519
2016-08-03T00:00:00.000000000: 520
2016-08-04T00:00:00.000000000: 521
2016-08-05T00:00:00.000000000: 522
2016-08-06T00:00:00.000000000: 523
2016-08-07T00:00:00.000000000: 524
2016-08-08T00:00:00.000000000: 525
2016-08-09T00:00:00.000000000: 526
2016-08-10T00:00:00.000000000: 527
2016-08-11T00:00:00.000000000: 528
2016-08-12T00:00:00.000000000: 529
2016-08-13T00:00:00.000000000: 530
2016-08-14T00:00:00.000000000: 531
2016-08-15T00:00:00.000000000: 532
2016-08-16T00:00:00.000000000: 533
2016-08-17T00:00:00.000000000: 534
2016-08-18T00:00:00.000000000: 535
2016-08-19T00:00:00.000000000: 536
2016-08-20T00:00:00.000000000: 537
2016-08-21T00:00:00.000000000: 538
2016-08-22T00:00:00.000000000: 539
2016-08-23T00:00:00.000000000: 540
2016-08-24T00:00:00.000000000: 541
2016-08-25T00:00:00.000000000: 542
2016-08-26T00:00:00.000000000: 543
2016-08-27T00:00:00.000000000: 544
2016-08-28T00:00:00.000000000: 545
2016-08-29T00:00:00.000000000: 546
2016-08-30T00:00:00.000000000: 547
2016-08-31T00:00:00.000000000: 548
2016-09-01T00:00:00.000000000: 549
2016-09-02T00:00:00.000000000: 550
2016-09-03T00:00:00.000000000: 551
2016-09-04T00:00:00.000000000: 552
2016-09-05T00:00:00.000000000: 553
2016-09-06T00:00:00.000000000: 554
2016-09-07T00:00:00.000000000: 555
2016-09-08T00:00:00.000000000: 556
2016-09-09T00:00:00.000000000: 557
2016-09-10T00:00:00.000000000: 558
2016-09-11T00:00:00.000000000: 559
2016-09-12T00:00:00.000000000: 560
2016-09-13T00:00:00.000000000: 561
2016-09-14T00:00:00.000000000: 562
2016-09-15T00:00:00.000000000: 563
2016-09-16T00:00:00.000000000: 564
2016-09-17T00:00:00.000000000: 565
2016-09-18T00:00:00.000000000: 566
2016-09-19T00:00:00.000000000: 567
2016-09-20T00:00:00.000000000: 568
2016-09-21T00:00:00.000000000: 569
2016-09-22T00:00:00.000000000: 570
2016-09-23T00:00:00.000000000: 571
2016-09-24T00:00:00.000000000: 572
2016-09-25T00:00:00.000000000: 573
2016-09-26T00:00:00.000000000: 574
2016-09-27T00:00:00.000000000: 575
2016-09-28T00:00:00.000000000: 576
2016-09-29T00:00:00.000000000: 577
2016-09-30T00:00:00.000000000: 578
2016-10-01T00:00:00.000000000: 579
2016-10-02T00:00:00.000000000: 580
2016-10-03T00:00:00.000000000: 581
2016-10-04T00:00:00.000000000: 582
2016-10-05T00:00:00.000000000: 583
2016-10-06T00:00:00.000000000: 584
2016-10-07T00:00:00.000000000: 585
2016-10-08T00:00:00.000000000: 586
2016-10-09T00:00:00.000000000: 587
2016-10-10T00:00:00.000000000: 588
2016-10-11T00:00:00.000000000: 589
2016-10-12T00:00:00.000000000: 590
2016-10-13T00:00:00.000000000: 591
2016-10-14T00:00:00.000000000: 592
2016-10-15T00:00:00.000000000: 593
2016-10-16T00:00:00.000000000: 594
2016-10-17T00:00:00.000000000: 595
2016-10-18T00:00:00.000000000: 596
2016-10-19T00:00:00.000000000: 597
2016-10-20T00:00:00.000000000: 598
2016-10-21T00:00:00.000000000: 599
2016-10-22T00:00:00.000000000: 600
2016-10-23T00:00:00.000000000: 601
2016-10-24T00:00:00.000000000: 602
2016-10-25T00:00:00.000000000: 603
2016-10-26T00:00:00.000000000: 604
2016-10-27T00:00:00.000000000: 605
2016-10-28T00:00:00.000000000: 606
2016-10-29T00:00:00.000000000: 607
2016-10-30T00:00:00.000000000: 608
2016-10-31T00:00:00.000000000: 609
2016-11-01T00:00:00.000000000: 610
2016-11-02T00:00:00.000000000: 611
2016-11-03T00:00:00.000000000: 612
2016-11-04T00:00:00.000000000: 613
2016-11-05T00:00:00.000000000: 614
2016-11-06T00:00:00.000000000: 615
2016-11-07T00:00:00.000000000: 616
2016-11-08T00:00:00.000000000: 617
2016-11-09T00:00:00.000000000: 618
2016-11-10T00:00:00.000000000: 619
2016-11-11T00:00:00.000000000: 620
2016-11-12T00:00:00.000000000: 621
2016-11-13T00:00:00.000000000: 622
2016-11-14T00:00:00.000000000: 623
2016-11-15T00:00:00.000000000: 624
2016-11-16T00:00:00.000000000: 625
2016-11-17T00:00:00.000000000: 626
2016-11-18T00:00:00.000000000: 627
2016-11-19T00:00:00.000000000: 628
2016-11-20T00:00:00.000000000: 629
2016-11-21T00:00:00.000000000: 630
2016-11-22T00:00:00.000000000: 631
2016-11-23T00:00:00.000000000: 632
2016-11-24T00:00:00.000000000: 633
2016-11-25T00:00:00.000000000: 634
2016-11-26T00:00:00.000000000: 635
2016-11-27T00:00:00.000000000: 636
2016-11-28T00:00:00.000000000: 637
2016-11-29T00:00:00.000000000: 638
2016-11-30T00:00:00.000000000: 639
2016-12-01T00:00:00.000000000: 640
2016-12-02T00:00:00.000000000: 641
2016-12-03T00:00:00.000000000: 642
2016-12-04T00:00:00.000000000: 643
2016-12-05T00:00:00.000000000: 644
2016-12-06T00:00:00.000000000: 645
2016-12-07T00:00:00.000000000: 646
2016-12-08T00:00:00.000000000: 647
2016-12-09T00:00:00.000000000: 648
2016-12-10T00:00:00.000000000: 649
2016-12-11T00:00:00.000000000: 650
2016-12-12T00:00:00.000000000: 651
2016-12-13T00:00:00.000000000: 652
2016-12-14T00:00:00.000000000: 653
2016-12-15T00:00:00.000000000: 654
2016-12-16T00:00:00.000000000: 655
2016-12-17T00:00:00.000000000: 656
2016-12-18T00:00:00.000000000: 657
2016-12-19T00:00:00.000000000: 658
2016-12-20T00:00:00.000000000: 659
2016-12-21T00:00:00.000000000: 660
2016-12-22T00:00:00.000000000: 661
2016-12-23T00:00:00.000000000: 662
2016-12-24T00:00:00.000000000: 663
2016-12-25T00:00:00.000000000: 664
2016-12-26T00:00:00.000000000: 665
2016-12-27T00:00:00.000000000: 666
2016-12-28T00:00:00.000000000: 667
2016-12-29T00:00:00.000000000: 668
2016-12-30T00:00:00.000000000: 669
2016-12-31T00:00:00.000000000: 670
2017-01-01T00:00:00.000000000: 671
2017-01-02T00:00:00.000000000: 672
2017-01-03T00:00:00.000000000: 673
2017-01-04T00:00:00.000000000: 674
2017-01-05T00:00:00.000000000: 675
2017-01-06T00:00:00.000000000: 676
2017-01-07T00:00:00.000000000: 677
2017-01-08T00:00:00.000000000: 678
2017-01-09T00:00:00.000000000: 679
2017-01-10T00:00:00.000000000: 680
2017-01-11T00:00:00.000000000: 681
2017-01-12T00:00:00.000000000: 682
2017-01-13T00:00:00.000000000: 683
2017-01-14T00:00:00.000000000: 684
2017-01-15T00:00:00.000000000: 685
2017-01-16T00:00:00.000000000: 686
2017-01-17T00:00:00.000000000: 687
2017-01-18T00:00:00.000000000: 688
2017-01-19T00:00:00.000000000: 689
2017-01-20T00:00:00.000000000: 690
2017-01-21T00:00:00.000000000: 691
2017-01-22T00:00:00.000000000: 692
2017-01-23T00:00:00.000000000: 693
2017-01-24T00:00:00.000000000: 694
2017-01-25T00:00:00.000000000: 695
2017-01-26T00:00:00.000000000: 696
2017-01-27T00:00:00.000000000: 697
2017-01-28T00:00:00.000000000: 698
2017-01-29T00:00:00.000000000: 699
2017-01-30T00:00:00.000000000: 700
2017-01-31T00:00:00.000000000: 701
2017-02-01T00:00:00.000000000: 702
2017-02-02T00:00:00.000000000: 703
2017-02-03T00:00:00.000000000: 704
2017-02-04T00:00:00.000000000: 705
2017-02-05T00:00:00.000000000: 706
2017-02-06T00:00:00.000000000: 707
2017-02-07T00:00:00.000000000: 708
2017-02-08T00:00:00.000000000: 709
2017-02-09T00:00:00.000000000: 710
2017-02-10T00:00:00.000000000: 711
2017-02-11T00:00:00.000000000: 712
2017-02-12T00:00:00.000000000: 713
2017-02-13T00:00:00.000000000: 714
2017-02-14T00:00:00.000000000: 715
2017-02-15T00:00:00.000000000: 716
2017-02-16T00:00:00.000000000: 717
2017-02-17T00:00:00.000000000: 718
2017-02-18T00:00:00.000000000: 719
2017-02-19T00:00:00.000000000: 720
2017-02-20T00:00:00.000000000: 721
2017-02-21T00:00:00.000000000: 722
2017-02-22T00:00:00.000000000: 723
2017-02-23T00:00:00.000000000: 724
2017-02-24T00:00:00.000000000: 725
2017-02-25T00:00:00.000000000: 726
2017-02-26T00:00:00.000000000: 727
2017-02-27T00:00:00.000000000: 728
2017-02-28T00:00:00.000000000: 729
2017-03-01T00:00:00.000000000: 730
2017-03-02T00:00:00.000000000: 731
2017-03-03T00:00:00.000000000: 732
2017-03-04T00:00:00.000000000: 733
2017-03-05T00:00:00.000000000: 734
2017-03-06T00:00:00.000000000: 735
2017-03-07T00:00:00.000000000: 736
2017-03-08T00:00:00.000000000: 737
2017-03-09T00:00:00.000000000: 738
2017-03-10T00:00:00.000000000: 739
2017-03-11T00:00:00.000000000: 740
2017-03-12T00:00:00.000000000: 741
2017-03-13T00:00:00.000000000: 742
2017-03-14T00:00:00.000000000: 743
2017-03-15T00:00:00.000000000: 744
2017-03-16T00:00:00.000000000: 745
2017-03-17T00:00:00.000000000: 746
2017-03-18T00:00:00.000000000: 747
2017-03-19T00:00:00.000000000: 748
2017-03-20T00:00:00.000000000: 749
2017-03-21T00:00:00.000000000: 750
2017-03-22T00:00:00.000000000: 751
2017-03-23T00:00:00.000000000: 752
2017-03-24T00:00:00.000000000: 753
2017-03-25T00:00:00.000000000: 754
2017-03-26T00:00:00.000000000: 755
2017-03-27T00:00:00.000000000: 756
2017-03-28T00:00:00.000000000: 757
2017-03-29T00:00:00.000000000: 758
2017-03-30T00:00:00.000000000: 759
2017-03-31T00:00:00.000000000: 760
2017-04-01T00:00:00.000000000: 761
2017-04-02T00:00:00.000000000: 762
2017-04-03T00:00:00.000000000: 763
2017-04-04T00:00:00.000000000: 764
2017-04-05T00:00:00.000000000: 765
2017-04-06T00:00:00.000000000: 766
2017-04-07T00:00:00.000000000: 767
2017-04-08T00:00:00.000000000: 768
2017-04-09T00:00:00.000000000: 769
2017-04-10T00:00:00.000000000: 770
2017-04-11T00:00:00.000000000: 771
2017-04-12T00:00:00.000000000: 772
2017-04-13T00:00:00.000000000: 773
2017-04-14T00:00:00.000000000: 774
2017-04-15T00:00:00.000000000: 775
2017-04-16T00:00:00.000000000: 776
2017-04-17T00:00:00.000000000: 777
2017-04-18T00:00:00.000000000: 778
2017-04-19T00:00:00.000000000: 779
2017-04-20T00:00:00.000000000: 780
2017-04-21T00:00:00.000000000: 781
2017-04-22T00:00:00.000000000: 782
2017-04-23T00:00:00.000000000: 783
2017-04-24T00:00:00.000000000: 784
2017-04-25T00:00:00.000000000: 785
2017-04-26T00:00:00.000000000: 786
2017-04-27T00:00:00.000000000: 787
2017-04-28T00:00:00.000000000: 788
2017-04-29T00:00:00.000000000: 789
2017-04-30T00:00:00.000000000: 790
2017-05-01T00:00:00.000000000: 791
2017-05-02T00:00:00.000000000: 792
2017-05-03T00:00:00.000000000: 793
2017-05-04T00:00:00.000000000: 794
2017-05-05T00:00:00.000000000: 795
2017-05-06T00:00:00.000000000: 796
2017-05-07T00:00:00.000000000: 797
2017-05-08T00:00:00.000000000: 798
2017-05-09T00:00:00.000000000: 799
2017-05-10T00:00:00.000000000: 800
2017-05-11T00:00:00.000000000: 801
2017-05-12T00:00:00.000000000: 802
2017-05-13T00:00:00.000000000: 803
2017-05-14T00:00:00.000000000: 804
2017-05-15T00:00:00.000000000: 805
2017-05-16T00:00:00.000000000: 806
2017-05-17T00:00:00.000000000: 807
2017-05-18T00:00:00.000000000: 808
2017-05-19T00:00:00.000000000: 809
2017-05-20T00:00:00.000000000: 810
2017-05-21T00:00:00.000000000: 811
2017-05-22T00:00:00.000000000: 812
2017-05-23T00:00:00.000000000: 813
2017-05-24T00:00:00.000000000: 814
2017-05-25T00:00:00.000000000: 815
2017-05-26T00:00:00.000000000: 816
2017-05-27T00:00:00.000000000: 817
2017-05-28T00:00:00.000000000: 818
2017-05-29T00:00:00.000000000: 819
2017-05-30T00:00:00.000000000: 820
2017-05-31T00:00:00.000000000: 821
2017-06-01T00:00:00.000000000: 822
2017-06-02T00:00:00.000000000: 823
2017-06-03T00:00:00.000000000: 824
2017-06-04T00:00:00.000000000: 825
2017-06-05T00:00:00.000000000: 826
2017-06-06T00:00:00.000000000: 827
2017-06-07T00:00:00.000000000: 828
2017-06-08T00:00:00.000000000: 829
2017-06-09T00:00:00.000000000: 830
2017-06-10T00:00:00.000000000: 831
2017-06-11T00:00:00.000000000: 832
2017-06-12T00:00:00.000000000: 833
2017-06-13T00:00:00.000000000: 834
2017-06-14T00:00:00.000000000: 835
2017-06-15T00:00:00.000000000: 836
2017-06-16T00:00:00.000000000: 837
2017-06-17T00:00:00.000000000: 838
2017-06-18T00:00:00.000000000: 839
2017-06-19T00:00:00.000000000: 840
2017-06-20T00:00:00.000000000: 841
2017-06-21T00:00:00.000000000: 842
2017-06-22T00:00:00.000000000: 843
2017-06-23T00:00:00.000000000: 844
2017-06-24T00:00:00.000000000: 845
2017-06-25T00:00:00.000000000: 846
2017-06-26T00:00:00.000000000: 847
2017-06-27T00:00:00.000000000: 848
2017-06-28T00:00:00.000000000: 849
2017-06-29T00:00:00.000000000: 850
2017-06-30T00:00:00.000000000: 851
2017-07-01T00:00:00.000000000: 852
2017-07-02T00:00:00.000000000: 853
2017-07-03T00:00:00.000000000: 854
2017-07-04T00:00:00.000000000: 855
2017-07-05T00:00:00.000000000: 856
2017-07-06T00:00:00.000000000: 857
2017-07-07T00:00:00.000000000: 858
2017-07-08T00:00:00.000000000: 859
2017-07-09T00:00:00.000000000: 860
2017-07-10T00:00:00.000000000: 861
2017-07-11T00:00:00.000000000: 862
2017-07-12T00:00:00.000000000: 863
2017-07-13T00:00:00.000000000: 864
2017-07-14T00:00:00.000000000: 865
2017-07-15T00:00:00.000000000: 866
2017-07-16T00:00:00.000000000: 867
2017-07-17T00:00:00.000000000: 868
2017-07-18T00:00:00.000000000: 869
2017-07-19T00:00:00.000000000: 870
2017-07-20T00:00:00.000000000: 871
2017-07-21T00:00:00.000000000: 872
2017-07-22T00:00:00.000000000: 873
2017-07-23T00:00:00.000000000: 874
2017-07-24T00:00:00.000000000: 875
2017-07-25T00:00:00.000000000: 876
2017-07-26T00:00:00.000000000: 877
2017-07-27T00:00:00.000000000: 878
2017-07-28T00:00:00.000000000: 879
2017-07-29T00:00:00.000000000: 880
2017-07-30T00:00:00.000000000: 881
2017-07-31T00:00:00.000000000: 882
2017-08-01T00:00:00.000000000: 883
2017-08-02T00:00:00.000000000: 884
2017-08-03T00:00:00.000000000: 885
2017-08-04T00:00:00.000000000: 886
2017-08-05T00:00:00.000000000: 887
2017-08-06T00:00:00.000000000: 888
2017-08-07T00:00:00.000000000: 889
2017-08-08T00:00:00.000000000: 890
2017-08-09T00:00:00.000000000: 891
2017-08-10T00:00:00.000000000: 892
2017-08-11T00:00:00.000000000: 893
2017-08-12T00:00:00.000000000: 894
2017-08-13T00:00:00.000000000: 895
2017-08-14T00:00:00.000000000: 896
2017-08-15T00:00:00.000000000: 897
2017-08-16T00:00:00.000000000: 898
2017-08-17T00:00:00.000000000: 899
2017-08-18T00:00:00.000000000: 900
2017-08-19T00:00:00.000000000: 901
2017-08-20T00:00:00.000000000: 902
2017-08-21T00:00:00.000000000: 903
2017-08-22T00:00:00.000000000: 904
2017-08-23T00:00:00.000000000: 905
2017-08-24T00:00:00.000000000: 906
2017-08-25T00:00:00.000000000: 907
2017-08-26T00:00:00.000000000: 908
2017-08-27T00:00:00.000000000: 909
2017-08-28T00:00:00.000000000: 910
2017-08-29T00:00:00.000000000: 911
2017-08-30T00:00:00.000000000: 912
2017-08-31T00:00:00.000000000: 913
2017-09-01T00:00:00.000000000: 914
2017-09-02T00:00:00.000000000: 915
2017-09-03T00:00:00.000000000: 916
2017-09-04T00:00:00.000000000: 917
2017-09-05T00:00:00.000000000: 918
2017-09-06T00:00:00.000000000: 919
2017-09-07T00:00:00.000000000: 920
2017-09-08T00:00:00.000000000: 921
2017-09-09T00:00:00.000000000: 922
2017-09-10T00:00:00.000000000: 923
2017-09-12T00:00:00.000000000: 924
2017-09-14T00:00:00.000000000: 925


Out[ ]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month meal country market_segment ... deposit_type days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date total_nights total_guests
0 1 0 342 2015 5 27 1 0 135 3 ... 0 0 2 0.0 0 0 1 121 0 2
1 1 0 737 2015 5 27 1 0 135 3 ... 0 0 2 0.0 0 0 1 121 0 2
2 1 0 7 2015 5 27 1 0 59 3 ... 0 0 2 75.0 0 0 1 122 1 1
3 1 0 13 2015 5 27 1 0 59 2 ... 0 0 2 75.0 0 0 1 122 1 1
4 1 0 14 2015 5 27 1 0 59 6 ... 0 0 2 98.0 0 1 1 123 2 2

5 rows × 27 columns

3.3 Data selection¶

Calculate correlation coefficients with the target variable 'is_canceled' to decide which attribute should be selected

In [ ]:
correlation_matrix = df.corr()
correlation_with_target = correlation_matrix['is_canceled'].sort_values(ascending=False)

# Display the correlation of each feature with 'is_canceled'
correlation_with_target
Out[ ]:
is_canceled                       1.000000
deposit_type                      0.468634
lead_time                         0.293123
country                           0.267502
distribution_channel              0.167600
previous_cancellations            0.110133
market_segment                    0.059338
days_in_waiting_list              0.054186
adr                               0.047557
total_guests                      0.046522
total_nights                      0.017779
arrival_date_year                 0.016660
arrival_date_week_number          0.008148
arrival_date_month               -0.001491
arrival_date_day_of_month        -0.006130
meal                             -0.017678
previous_bookings_not_canceled   -0.057358
reserved_room_type               -0.061282
customer_type                    -0.068140
is_repeated_guest                -0.084793
hotel                            -0.136531
booking_changes                  -0.144381
reservation_status_date          -0.162135
assigned_room_type               -0.176028
required_car_parking_spaces      -0.195498
total_of_special_requests        -0.234658
reservation_status               -0.917196
Name: is_canceled, dtype: float64

Visualize the correlation of features with 'is_canceled'

This visualization step creates a bar plot of the correlation coefficients, making it easier to visually determine which features have strong or weak correlations with the cancellation status.

In [ ]:
plt.figure(figsize=(12, 8))
sns.barplot(x=correlation_with_target.values, y=correlation_with_target.index, palette='mako')
plt.title('Correlation of Features with Cancellation Status')
plt.xlabel('Correlation coefficient')
plt.ylabel('Features')
plt.show()
C:\Users\Acer\AppData\Local\Temp\ipykernel_14172\2153014193.py:2: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=correlation_with_target.values, y=correlation_with_target.index, palette='mako')
No description has been provided for this image

Select features with significant correlation

Here, we consider |correlation| > 0.01 as significant. The target variable is_canceled is also removed from this list.

In [ ]:
significant_features = correlation_with_target[abs(correlation_with_target) > 0.01].index.tolist()

# Remove the target variable 'is_canceled' from the list of features
significant_features.remove('is_canceled')
significant_features.remove('reservation_status')

# Display the significant features
significant_features
Out[ ]:
['deposit_type',
 'lead_time',
 'country',
 'distribution_channel',
 'previous_cancellations',
 'market_segment',
 'days_in_waiting_list',
 'adr',
 'total_guests',
 'total_nights',
 'arrival_date_year',
 'meal',
 'previous_bookings_not_canceled',
 'reserved_room_type',
 'customer_type',
 'is_repeated_guest',
 'hotel',
 'booking_changes',
 'reservation_status_date',
 'assigned_room_type',
 'required_car_parking_spaces',
 'total_of_special_requests']

Create a new dataframe with significant features to store it

In [ ]:
data_selected = df[significant_features]
target= df['is_canceled']
# Display the selected data
data_selected
Out[ ]:
deposit_type lead_time country distribution_channel previous_cancellations market_segment days_in_waiting_list adr total_guests total_nights ... previous_bookings_not_canceled reserved_room_type customer_type is_repeated_guest hotel booking_changes reservation_status_date assigned_room_type required_car_parking_spaces total_of_special_requests
0 0 342 135 1 0 3 0 0.00 2 0 ... 0 2 2 0 1 3 121 2 0 0
1 0 737 135 1 0 3 0 0.00 2 0 ... 0 2 2 0 1 4 121 2 0 0
2 0 7 59 1 0 3 0 75.00 1 1 ... 0 0 2 0 1 0 122 2 0 0
3 0 13 59 0 0 2 0 75.00 1 1 ... 0 0 2 0 1 0 122 0 0 0
4 0 14 59 3 0 6 0 98.00 2 2 ... 0 0 2 0 1 0 123 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
119385 0 23 15 3 0 5 0 96.14 2 7 ... 0 0 2 0 0 0 919 0 0 0
119386 0 102 56 3 0 6 0 225.43 3 7 ... 0 4 2 0 0 0 920 4 0 2
119387 0 34 43 3 0 6 0 157.71 2 7 ... 0 3 2 0 0 0 920 3 0 4
119388 0 109 59 3 0 6 0 104.40 2 7 ... 0 0 2 0 0 0 920 0 0 0
119389 0 205 43 3 0 6 0 151.20 2 9 ... 0 0 2 0 0 0 920 0 0 2

119390 rows × 22 columns

Visualise the correlation coefficient between the selected features to have a better understanding of their relationship

In [ ]:
plt.figure(figsize=(15, 10))
sns.heatmap(data_selected.corr(), vmin = -1, vmax = 1, annot = True, fmt = ".2f",  cmap="mako")
Out[ ]:
<Axes: >
No description has been provided for this image

3.4: Split the data¶

We are consider 30% of the dataset will be used for testing, and the remaining 70% will be used for training.

In [ ]:
from sklearn.model_selection import train_test_split
# Split the data
X_train, X_test, y_train, y_test = train_test_split(data_selected, target, test_size=0.3, random_state=42)

# Print the shapes of the resulting splits
print(f"Training feature set shape: {X_train.shape}")
print(f"Testing feature set shape: {X_test.shape}")
print(f"Training target set shape: {y_train.shape}")
print(f"Testing target set shape: {y_test.shape}")
Training feature set shape: (83573, 22)
Testing feature set shape: (35817, 22)
Training target set shape: (83573,)
Testing target set shape: (35817,)

Explanation:

Component Explanation
x_train The training set of features, which is 70% of the original data (based on the test_size=0.3 parameter). This subset is used to train the machine learning model. Thus, we have 83,573 rows as our training set.
x_test The testing set of features, which is 30% of the original data. This subset is used to evaluate the performance of the model on unseen data. Thus, we have 35,871 rows as our testing set.
y_train The training set of target values, corresponding to x_train. These are the actual cancellation statuses used to train the model.
y_test The testing set of target values, corresponding to x_test. These are the actual cancellation statuses used to evaluate the model's predictions.

4.0 Model Development¶

4.1 The Selected of Machine Learning Algorithm is:¶

  1. Random Forest Algorithm
  2. Gradient Boosting Algorithm

4.2 Model Training¶

4.2.1 Random Forest Model¶

The Random Forest Classifier is an ensemble learning method. It is particularly suited for classification tasks like ours, where we aim to predict whether a hotel booking will be cancelled. The model operates by constructing a multitude of decision trees during training and outputting the class that is the mode of the classes of the individual trees.

In [ ]:
# Modelling
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from scipy.stats import randint

# Tree Visualisation
from sklearn.tree import export_graphviz
from IPython.display import Image
import graphviz
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

Training Process of the Random Forest Model

In [ ]:
# Create a random forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model and print confirmation
print("Training the model...")
rf.fit(X_train, y_train)
print("Model training completed.")
Training the model...
Model training completed.
In [ ]:
# Make predictions and print a sample of predictions
print("Making predictions on the test set...")
rf_predictions = rf.predict(X_test)

# Create a DataFrame to compare true labels and predictions
results_df = pd.DataFrame({
    'True Label (is cancelled)': y_test,
    'Predicted': rf_predictions
})

# Display the results for the first 10 samples
print("\nSample Predictions vs True Labels:")
print(results_df.head(10))

# Print classification report
print("Classification Report:")
print(classification_report(y_test, rf_predictions))

# Print accuracy
accuracy = accuracy_score(y_test, rf_predictions)
print(f"Random Forest Accuracy: {accuracy}")
Making predictions on the test set...

Sample Predictions vs True Labels:
        True Label (is cancelled)  Predicted
30946                           0          0
40207                           1          1
103708                          0          0
85144                           0          0
109991                          0          0
110622                          0          1
47790                           1          1
44992                           0          0
30528                           0          0
16886                           0          0
Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.96      0.94     22478
           1       0.93      0.85      0.89     13339

    accuracy                           0.92     35817
   macro avg       0.92      0.91      0.91     35817
weighted avg       0.92      0.92      0.92     35817

Random Forest Accuracy: 0.9210989195074964

Result:

The Random Forest Algorithm achieved an accuracy of approximately 92.11%. This indicates a high level of overall correctness in its predictions.

Hyperparameter Tuning with Random Forest

Some key of hyperparameter tuning with Random Forest Model are used:

Key Hyperparameters Explanation
n_estimators Number of trees in the forest. More trees can improve accuracy but increase computation time.
max_depth Maximum depth of each tree. Deeper trees can capture more complex patterns but may lead to overfitting.
min_samples_split Minimum number of samples required to split an internal node. Higher values prevent the model from learning overly specific patterns.
min_samples_leaf Minimum number of samples required to be at a leaf node. Higher values can smooth out the model from learning overly specific patterns.
max_features Number of features considered for splitting at each node. Limiting this can reduce overfitting.
In [ ]:
param_dist = {
    'n_estimators': randint(100, 200),
    'max_depth': randint(10, 50),
    'min_samples_split': randint(2, 10),
    'min_samples_leaf': randint(1, 5),
    'max_features': ['auto', 'sqrt', 'log2']
}
# Initialize the Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(
    rf,
    param_distributions=param_dist,
    n_iter=50,
    cv=5,
    verbose=2,
    n_jobs=-1,
    random_state=42
)

# Perform hyperparameter tuning
print("Performing hyperparameter tuning...")
random_search.fit(X_train, y_train)
print("Tuning completed.")

# Get the best parameters
best_params = random_search.best_params_
print("Best parameters found:")
print(best_params)
Performing hyperparameter tuning...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
c:\Users\Acer\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\model_selection\_validation.py:540: FitFailedWarning: 
60 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Acer\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Acer\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py", line 1466, in wrapper
    estimator._validate_params()
  File "c:\Users\Acer\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\Acer\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of RandomForestClassifier must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'sqrt', 'log2'} or None. Got 'auto' instead.

--------------------------------------------------------------------------------
35 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Acer\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Acer\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py", line 1466, in wrapper
    estimator._validate_params()
  File "c:\Users\Acer\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\Acer\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of RandomForestClassifier must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'log2', 'sqrt'} or None. Got 'auto' instead.

  warnings.warn(some_fits_failed_message, FitFailedWarning)
c:\Users\Acer\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\model_selection\_search.py:1052: UserWarning: One or more of the test scores are non-finite: [       nan 0.9090735  0.89700023 0.90860685 0.90949231        nan
        nan 0.90861883        nan 0.89659341 0.90559152 0.86399913
 0.89191485 0.90245655 0.88620726 0.90982736 0.90920513 0.90240868
        nan 0.91323751 0.91514004        nan 0.90586673 0.89154392
 0.9136563  0.89618656        nan 0.9142905  0.90804446 0.90294712
 0.90656074 0.91399135 0.87765185        nan 0.90888206 0.88669784
 0.90988718 0.91348878 0.87227931 0.89915402 0.9056992         nan
 0.89987199        nan 0.8640829         nan 0.90249244 0.90171469
        nan 0.90411976]
  warnings.warn(
Tuning completed.
Best parameters found:
{'max_depth': 38, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 188}

Using best parameters founded train again the model

In [ ]:
# Train the Random Forest model with the best parameters
rf_best = RandomForestClassifier(
    n_estimators=best_params['n_estimators'],
    max_depth=best_params['max_depth'],
    min_samples_split=best_params['min_samples_split'],
    min_samples_leaf=best_params['min_samples_leaf'],
    max_features=best_params['max_features'],
    random_state=42
)

print("Training the optimized Random Forest model...")
rf_best.fit(X_train, y_train)
print("Model training completed.")
Training the optimized Random Forest model...
Model training completed.
In [ ]:
# Make predictions on the test set
print("Making predictions on the test set...")
rf_predictions_hy = rf_best.predict(X_test)

# Create a DataFrame to compare true labels and predictions
results_df_hy = pd.DataFrame({
    'True Label (is cancelled)': y_test,
    'Predicted': rf_predictions_hy
})

# Display the results for the first 10 samples
print("\nSample Predictions vs True Labels:")
print(results_df_hy.head(10))

# Print classification report
print("Classification Report:")
print(classification_report(y_test, rf_predictions_hy))

# Calculate accuracy
accuracy = accuracy_score(y_test, rf_predictions_hy)
print(f"Random Forest Accuracy: {accuracy}")
Making predictions on the test set...

Sample Predictions vs True Labels:
        True Label (is cancelled)  Predicted
30946                           0          0
40207                           1          1
103708                          0          0
85144                           0          0
109991                          0          0
110622                          0          1
47790                           1          1
44992                           0          0
30528                           0          0
16886                           0          0
Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.96      0.94     22478
           1       0.93      0.85      0.89     13339

    accuracy                           0.92     35817
   macro avg       0.92      0.91      0.91     35817
weighted avg       0.92      0.92      0.92     35817

Random Forest Accuracy: 0.9213781165368401

Result:

Random Forest Algorithm by hyperparameter tuning achieved an accuracy of approximately 92.13%. It had improved by 0.02% from the model where without using the hyperparameter tunings. This also indicates a high level of overall correctness in its predictions.

Tree Visualisation for Random Forest Algorithm¶

In [ ]:
from matplotlib import pyplot as plt
from sklearn.tree import plot_tree

# Visualize the first three trees from the optimized Random Forest
for i in range(3):
    plt.figure(figsize=(20, 5))  # Set figure size
    tree = rf_best.estimators_[i]  # Access the i-th tree from the Random Forest
    plot_tree(tree, 
              filled=True, 
              feature_names=X_train.columns,  # Replace with actual feature names
              class_names=['Not Canceled', 'Canceled'], 
              rounded=True,
              max_depth=2,  # Limit the depth for clearer visualization
              fontsize=10)  # Adjust fontsize for better readability
    plt.title(f'Tree {i+1} from Optimized Random Forest')
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Explanation:

  • Tree 1 Journey in Random Forest Model:

    • Begins at the root node with the feature ‘previous_cancellations’.
      • First Decision: Assesses if a booking has any history of being cancelled before (previous_cancellations <= 0.5).
      • Initial Split: Divides the dataset of 52,848 samples based on:
        • No prior cancellations (left-branch).
        • At least one prior cancellation (right-branch).
      • Gini Impurity: At this stage, the Gini impurity is 0.468, indicating moderate uncertainty.
        • This impurity reflects a mix of both classes: Not-cancelled and Cancelled.
        • The aim is to find a split that reduces this impurity.
  • Left-child Node (No Previous Cancellations):

    • Next Decision: Evaluates the ‘lead_time’, the days between booking and check-in date.
      • Lead Time Check: Determines if lead_time <= 26.5 days.
      • Node Handling: Manages 49,931 samples, dividing them into:
        • Closer to stay date (<= 26.5 days).
        • Well in advance (> 26.5 days).
      • Implications of Lead Time:
        • Short lead times often indicate a firmer booking commitment and lower cancellation risk.
      • Gini Impurity: Decreases to 0.45, showing improved homogeneity.
        • Stronger bias towards non-cancellations with:
          • 29,647 samples classified as Not Cancelled.
          • 20,284 samples classified as Cancelled.
  • Right-child Node (With Previous Cancellations):

    • Next Decision: Examines the is_repeated_guest feature.
      • Repeated Guest Check: Considers if the guest has stayed at the hotel before (is_repeated_guest <= 0.5).
      • Node Coverage: Encompasses 2,917 samples, predominantly leading towards cancellations.
      • Gini Impurity: Drops significantly to 0.159, indicating high purity.
        • Most samples are Cancelled (1,917) compared to Not Cancelled (1,000).
      • Insights:
        • Repeated guests, with a known and favorable relationship with the hotel, are less likely to cancel.
        • This substantial reduction in Gini impurity highlights a decisive split, indicating higher cancellation risk among new guests with previous cancellations.
  • Tree 1 Continuation:

    • The tree continues making decisions until it predicts a class.
    • However, this is not the final predicted class.
  • Random Forest Model:

    • Executes similar processes on each tree within the forest.
    • Each tree focuses on different features.
    • The final prediction is made by taking a vote, with the majority class as the final prediction.
    • Simplified Tree Visualizations:
      • Capped at a depth of 2, these visualizations show the fundamental decision-making pathways.
      • The complete Random Forest includes many more trees and deeper splits.
      • These examples clarify the basic logic guiding the model’s predictions.
  • Model Robustness and Accuracy:

    • By averaging the outcomes of multiple trees, the Random Forest achieves higher accuracy and robustness in predicting booking cancellations.

4.2.2 Gradient Boosting Model¶

Training Process of the Gradient Boosting Model

Gradient Boosting is a powerful machine learning algorithm that can be used for both regression and classification problems. As a part of the ensemble learning family, it combines several weak learners (decision trees) into strong learners, or in other words, combines the predictions from multiple weaker models to produce a single and stronger predictive model. The core idea behind Gradient Boosting is to build a model by adding new models that reduce the errors by the previous ones.

In [ ]:
import time
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
In [ ]:
# Identify numerical features for scaling
numerical_features = data_selected.columns.tolist()

# Define the preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numerical_features),
    ]
)

# Create the pipeline
pipeline = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("classifier", GradientBoostingClassifier(random_state=42)),
    ]
)

# Perform 5-fold cross-validation
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)

# Fit the model on the training data
pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = pipeline.predict(X_test)
In [ ]:
# Make predictions and print a sample of predictions
print("Making predictions on the test set...")
#gb_predictions = pipeline.predict(X_test)

# Create a DataFrame to compare true labels and predictions
results_df = pd.DataFrame({
    'True Label (is cancelled)': y_test,
    'Predicted': y_pred    #gb_predictions
})

# Display the results for the first 10 samples
print("\nSample Predictions vs True Labels:")
print(results_df.head(10))

# Print accuracy
accuracy = accuracy_score(y_test, y_pred) # y_pred = gb_predictions
print(f"\nGradient Boosting Accuracy: {accuracy}")

# Generate classification report
report = classification_report(y_test, y_pred)

print(f"Mean Cross-Validation Accuracy: {cv_scores.mean():.4f}")
print("\nClassification Report:")
print(report)
Making predictions on the test set...

Sample Predictions vs True Labels:
        True Label (is cancelled)  Predicted
30946                           0          0
40207                           1          1
103708                          0          0
85144                           0          0
109991                          0          1
110622                          0          1
47790                           1          0
44992                           0          0
30528                           0          0
16886                           0          0

Gradient Boosting Accuracy: 0.86983834492001
Mean Cross-Validation Accuracy: 0.8673

Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.94      0.90     22478
           1       0.89      0.74      0.81     13339

    accuracy                           0.87     35817
   macro avg       0.87      0.84      0.86     35817
weighted avg       0.87      0.87      0.87     35817

Result:

The Gradient Boosting Algorithm achieved an accuracy of approximately 86.98%. This indicates a high level of overall correctness in its predictions but slightly lower than the Random Forest Algorithm.

Hyperparameter Tuning with Gradient Boosting Model

The key of hyperparameter tuning of Gradient Boosting Model are:

Key Hyperparameters Explanation
n_estimators Specifies the number of boosting stages or trees in the ensemble. More trees lead to better performance but also increase the risk of overfitting. Default: n_estimators = 100
learning_rate Determines the contribution of each tree to the final model. Smaller learning rates require more trees but can improve model generalization. Default: learning_rate = 0.1
max_depth Sets the maximum depth of each individual tree. Limiting the depth helps prevent overfitting by controlling the models complexity. Default: max_depth = 3
In [ ]:
# Define the parameter grid for GridSearchCV
param_grid = {
    'classifier__n_estimators': [50, 100, 150],
    'classifier__learning_rate': [0.01, 0.1, 0.2],
    'classifier__max_depth': [3, 4, 5]
}

# Perform GridSearchCV
gbr = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=2)
gbr.fit(X_train, y_train)

# Get the best parameters and best score
best_params = gbr.best_params_
best_score = gbr.best_score_

# Predict on the test set using the best estimator
y_pred = gbr.best_estimator_.predict(X_test)
Fitting 5 folds for each of 27 candidates, totalling 135 fits
In [ ]:
# Make predictions and print a sample of predictions
print("Making predictions on the test set after hyperparameter tuning...")

# Create a DataFrame to compare true labels and predictions
results_df = pd.DataFrame({
    'True Label (is cancelled)': y_test,
    'Predicted': y_pred
})

# Display the results for the first 10 samples
print("\nSample Predictions vs True Labels:")
print(results_df.head(10))

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nGradient Boosting Accuracy: {accuracy}")

# Generate classification report
report = classification_report(y_test, y_pred)

print(f"\nBest Parameters: {best_params}")
print(f"Best Cross-Validation Score: {best_score:.4f}")
print("\nClassification Report:")
print(report)
Making predictions on the test set after hyperparameter tuning...

Sample Predictions vs True Labels:
        True Label (is cancelled)  Predicted
30946                           0          0
40207                           1          1
103708                          0          0
85144                           0          0
109991                          0          1
110622                          0          0
47790                           1          1
44992                           0          0
30528                           0          0
16886                           0          0

Gradient Boosting Accuracy: 0.9050730100231733

Best Parameters: {'classifier__learning_rate': 0.2, 'classifier__max_depth': 5, 'classifier__n_estimators': 150}
Best Cross-Validation Score: 0.9037

Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.95      0.93     22478
           1       0.91      0.83      0.87     13339

    accuracy                           0.91     35817
   macro avg       0.91      0.89      0.90     35817
weighted avg       0.91      0.91      0.90     35817

Result:

Gradient Boosting Algorithm by hyperparameter tuning achieved an accuracy of approximately 90.51%. It had improved by 3.53% after performing hyperparameter tuning. This also indicates a high level of overall correctness in its predictions.

Tree Visualisation of Gradient Boosting Model

In [ ]:
# Extract the trained Gradient Boosting model from the pipeline
best_pipeline = gbr.best_estimator_
gbr_model = best_pipeline.named_steps['classifier']

# Define a function to visualize the decision trees
def visualize_decision_tree(model, tree_index, feature_names):
    """Visualize a decision tree from a gradient boosting model"""
    plt.figure(figsize=(20, 8))
    plot_tree(
        model.estimators_[tree_index, 0],
        feature_names=feature_names,
        filled=True,
        rounded=True,
        fontsize=13,
        max_depth=2
    )
    plt.show()

# Visualize a couple of decision trees
# Note: feature_names should include the transformed feature names
# Here, we take the original feature names for simplicity, but you may need to adjust this
feature_names = categorical_features + numerical_features

visualize_decision_tree(gbr_model, tree_index=0, feature_names=feature_names)  # Visualize the first tree
visualize_decision_tree(gbr_model, tree_index=1, feature_names=feature_names)  # Visualize the second tree
No description has been provided for this image
No description has been provided for this image

Explanation:

The visualizations represent the first and second decision trees within our gradient boosting model, highlighting the initial decisions made based on the numerical features. Each tree is visualized up to a maximum depth of 2, simplifying the interpretation of the model's decision-making process.

In the root node of the first tree, the split condition is ‘hotel’ <= 1.12, with a Friedman’s Mean Squared Error (MSE) of 0.233. The Friedman’s MSE measures the quality of the split by calculating the pseudo-residuals based on the current model's predictions and the actual target values to improve the squared error after the split. This node evaluates all 83,573 samples and has an initial predicted value of -0.0. The split on ‘hotel’ indicates its importance in the initial decision-making process.

By examining these visualizations, we observe how the gradient boosting model starts to capture patterns in the data. The first tree focuses on ‘hotel’, splitting at a threshold of 1.12, laying the foundation for subsequent trees to refine the model's predictions. These early splits help us understand the model's approach to distinguishing different outcomes based on the features.

5.0 Result and Discussion¶

5.1 Model Performance¶

5.1.1 Accuracy Scores and Other Metrices¶

Result Comparison Between Random Forest and Gradient Boosting

To compare the performance of Random Forest (RF) and Gradient Boosting (GB) models, we will look at various metrics such as accuracy, precision, recall, F1-score, and the overall classification report.

Metric Random Forest (RF) Gradient Boosting (GB)
Accuracy 0.9213781165368401 0.9050730100231733
Metric Algorithm Precision Recall F1-Score
Macro Average RF 0.92 0.91 0.91
GB 0.91 0.89 0.90
Weighted Average RF 0.92 0.92 0.92
GB 0.92 0.91 0.90

Based on the performance metrics provided, the Random Forest model outperforms the Gradient Boosting model across all evaluated metrics, including accuracy, precision, recall, and F1-score. The higher values of these metrics for Random Forest indicate that it provides more reliable and accurate predictions compared to Gradient Boosting in this scenario.

This can be attributed to the robust ensemble method of Random Forest, which effectively reduces variance and mitigates overfitting through random sampling. Additionally, its relative simplicity in parameter tuning makes it a more reliable choice in scenarios where the Gradient Boosting model might struggle due to overfitting or the complexity of parameter optimization.

Interpreting Model Reliability Through Performance Scores

Diving into the comparative analysis for both models, Class 0 represents the cancelled bookings and Class 1 represents not cancelled bookings.

Precision and Recall

  • Class 0:
    • Random Forest: Precision = 0.92, Recall = 0.96
    • Gradient Boosting: Precision = 0.90, Recall = 0.95
  • Class 1:
    • Random Forest: Precision = 0.93, Recall = 0.85
    • Gradient Boosting: Precision = 0.91, Recall = 0.83

F1-score

  • The F1-score measures the balance between precision and recall:
    • Class 0:
      • Random Forest: F1-score = 0.94
      • Gradient Boosting: F1-score = 0.93
    • Class 1:
      • Random Forest: F1-score = 0.89
      • Gradient Boosting: F1-score = 0.87

Macro and Weighted Average

  • Both macro and weighted averages for precision, recall, and F1-score favor the Random Forest model over the Gradient Boosting model.

Accuracy

  • Random Forest model achieves higher accuracy (0.92) compared to Gradient Boosting model (0.90).

Based on the values above, it is observed that the Random Forest Model is more reliable compared to the Gradient Boosting model, providing better overall accuracy and performance across both class 0 and class 1.

5.1.2 Confusion Matrix¶

In [ ]:
# Import necessary libraries
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
In [ ]:
# Create the confusion matrices
cm_gb = confusion_matrix(y_test, y_pred)
cm_rf = confusion_matrix(y_test, rf_predictions_hy)

# Create a figure with two subplots (1 row, 2 columns)
fig, axes = plt.subplots(1, 2, figsize=(14, 6))  # Adjust figsize as needed

# Plot the confusion matrix for Random Forest Classifier
disp_rf = ConfusionMatrixDisplay(confusion_matrix=cm_rf)
disp_rf.plot(ax=axes[0], cmap='Blues', values_format='.2f')
axes[0].grid(False)  # Remove grid lines
axes[0].set_title('Confusion Matrix for Random Forest Classifier')

# Plot the confusion matrix for Gradient Boosting Classifier
disp_gb = ConfusionMatrixDisplay(confusion_matrix=cm_gb)
disp_gb.plot(ax=axes[1], cmap='Blues', values_format='.2f')
axes[1].grid(False)  # Remove grid lines
axes[1].set_title('Confusion Matrix for Gradient Boosting Classifier')

# Adjust layout and display the plot
plt.tight_layout()
plt.show()
No description has been provided for this image

Analysis:

According to Figure 25 and Figure 25, the low number of false positives (820) and false negatives (1996) indicates that the Random Forest model is effective in distinguishing between cancelled and non-cancelled bookings. In comparison, the higher number of false positives (1245) and false negatives (3417) suggests that the Gradient Boosting model is more prone to misclassifications in this specific dataset.

The analysis indicates that the Random Forest model outperforms the Gradient Boosting model in this dataset, particularly in terms of fewer false positives and false negatives. This makes the Random Forest model more reliable for predicting booking cancellations. The higher false positive rate in the Gradient Boosting model could result in unnecessary cancellations and lost business opportunities, emphasizing the need for careful consideration when choosing predictive models for this task. Efforts should focus on further reducing the false positive rate to enhance predictive accuracy and operational efficiency.

5.2 Important Features for Prediction¶

Feature Importance of Random Forest Model¶

Predicting hotel booking cancellations is crucial for effective revenue management and operational planning in the hospitality industry. By understanding which features most significantly influence cancellation probabilities, hotel managers can better anticipate future cancellations and implement strategies to mitigate potential losses. In our analysis, we examine the importance of various features in predicting cancellations using two machine learning models: the Random Forest Classifier and the Gradient Boosting Classifier.

We first extract the feature importance scores from the ‘rf_best’ Random Forest model. The ‘feature_importances_’ attribute of the Random Forest model returns an array of importance scores for each feature used in the model.

In [ ]:
importances = rf_best.feature_importances_
features = X_train.columns
feature_importance_rf = pd.DataFrame({'Feature': features, 'Importance': importances})
feature_importance_rf = feature_importance_rf.sort_values(by='Importance', ascending=False)
print(feature_importance_rf)
                           Feature  Importance
18         reservation_status_date    0.167209
1                        lead_time    0.135193
0                     deposit_type    0.116260
2                          country    0.109524
7                              adr    0.091920
5                   market_segment    0.057141
21       total_of_special_requests    0.056481
9                     total_nights    0.043995
4           previous_cancellations    0.027239
10               arrival_date_year    0.025800
19              assigned_room_type    0.025205
14                   customer_type    0.025133
20     required_car_parking_spaces    0.021445
17                 booking_changes    0.020220
8                     total_guests    0.017392
13              reserved_room_type    0.015171
11                            meal    0.013552
16                           hotel    0.011705
3             distribution_channel    0.011591
12  previous_bookings_not_canceled    0.004038
15               is_repeated_guest    0.002122
6             days_in_waiting_list    0.001668

In figure below, the Random Forest Classifier identifies the most important features for predicting hotel booking cancellations. The top three features are ‘reservation_status_date’, ‘lead_time’, and ‘deposit_type’. These features have the highest importance scores, indicating they play a crucial role in predicting whether a booking will be canceled. The ‘reservation_status_date’ likely captures the proximity of the reservation date to the actual booking date, which can be critical as last-minute changes or cancellations are more common. ‘lead_time’, the time between booking and arrival, is also highly influential as longer lead times may increase the likelihood of cancellations due to changes in plans. ‘deposit_type’ reflects the payment terms associated with the booking, where more flexible payment terms may lead to higher cancellation rates.

In [ ]:
# Plotting the feature importance
plt.figure(figsize=(12, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_rf.head(10), palette='Blues')
plt.title('Top 10 Features Influencing Hotel Booking Cancellations from Random Forest Classifier')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
C:\Users\Acer\AppData\Local\Temp\ipykernel_14172\947144894.py:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='Importance', y='Feature', data=feature_importance_rf.head(10), palette='Blues')
No description has been provided for this image

Feature Importance of Gradient Boosting Model¶

In [ ]:
# Extract the best pipeline (including preprocessing and model) from GridSearchCV
best_pipeline = gbr.best_estimator_

# Extract the Gradient Boosting classifier from the pipeline
best_model = best_pipeline.named_steps['classifier']

# Get feature importances from the Gradient Boosting model
feature_importances = best_model.feature_importances_

# Since all features are numerical, use their original names
numerical_feature_names = data_selected.columns

# Create a DataFrame for better visualization
importances_df = pd.DataFrame({
    'Feature': numerical_feature_names,
    'Importance': feature_importances
})

# Sort the DataFrame by importance
importances_df = importances_df.sort_values(by='Importance', ascending=False)

# Display the sorted feature importances
print("\nFeature Importances:")
print(importances_df)
Feature Importances:
                           Feature  Importance
0                     deposit_type    0.340626
18         reservation_status_date    0.125419
1                        lead_time    0.090650
5                   market_segment    0.084801
2                          country    0.084536
21       total_of_special_requests    0.066109
10               arrival_date_year    0.038992
7                              adr    0.032979
20     required_car_parking_spaces    0.031049
4           previous_cancellations    0.029959
14                   customer_type    0.013514
17                 booking_changes    0.011312
12  previous_bookings_not_canceled    0.010607
13              reserved_room_type    0.008431
9                     total_nights    0.008232
19              assigned_room_type    0.006303
16                           hotel    0.005644
11                            meal    0.005149
8                     total_guests    0.002659
6             days_in_waiting_list    0.001462
3             distribution_channel    0.001169
15               is_repeated_guest    0.000399

In the figure, the Gradient Boosting Classifier highlights ‘deposit_type’, ‘reservation_status_date’ and ‘lead_time’ as the most important features for predicting cancellations. Here, ‘deposit_type’ has an even more pronounced impact compared to the Random Forest model, underscoring its significance in cancellation predictions. The substantial importance of ‘deposit_type’ suggests that financial policies tied to bookings are a critical determinant of whether a customer will cancel. ‘reservation_status_date’ and ‘lead_time’ again show their relevance, indicating consistency in these factors across different modeling techniques.

In [ ]:
#import matplotlib.pyplot as plt
#import seaborn as sns

# Plotting the feature importance
plt.figure(figsize=(12, 6))
sns.barplot(x='Importance', y='Feature', data=importances_df .head(10), palette='Blues')
plt.title('Top 10 Features Influencing Hotel Booking Cancellations from Gradient Boosting Classifier')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
C:\Users\Acer\AppData\Local\Temp\ipykernel_14172\1534948352.py:6: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='Importance', y='Feature', data=importances_df .head(10), palette='Blues')
No description has been provided for this image

From the outputs from both predictive models, we can conclude that the 3 of the most important features are ‘lead_time’, ‘reservation_status_date’ and ‘deposit_type’, with different rankings. These features influence prediction outcomes significantly. For instance, bookings with shorter ‘reservation_status_date’ intervals are less likely to be canceled as they are closer to the stay date, reducing the chance for a change of plans. Also, higher ‘lead_time’ values indicate more time for potential changes in the customer's situation, increasing cancellation risks. Examples from the dataset include bookings with lead times of 342 days and 737 days, with longer lead times associated with higher cancellation probabilities. The ‘deposit_type’ affects financial commitment where non-refundable deposits may reduce cancellations, whereas refundable deposits provide flexibility, potentially leading to higher cancellations. For example in our dataset, bookings with "No Deposit" policies are more likely to be canceled due to the lack of financial commitment.

In short, the differences of output for both models is because the Random Forest model aggregates the importance of features by averaging over many decision trees. This model indicated reservation_status_date, lead_time, and deposit_type as highly important because these features consistently influence the model's decision-making process in determining cancellations across various splits of the data. Gradient Boosting, on the other hand, builds trees sequentially, where each new tree attempts to correct errors made by the previous ones. This model often emphasizes features that correct the most significant mistakes from previous iterations. The model might also have given higher importance to reservation_status_date and lead_time as they substantially improve the model's performance when included in the sequential corrections.

5.3 Discussion of Findings¶

5.3.1 Implications for Hotel Management

At the end of our project, we have successfully built a robust model for predicting booking cancellations, with the Random Forest (RF) model outperforming the Gradient Boosting (GB) model across all key performance metrics. Implementing this model can significantly enhance the decision-making processes in the hotels involved.

By predicting the likelihood of cancellations, hotels can offer more flexible booking options to customers less likely to cancel, enhancing customer satisfaction while minimizing revenue loss. For bookings predicted to have a high cancellation risk, hotels can offer non-refundable booking incentives or require deposits, thus securing some revenue even in case of cancellations. Insights from the model also enable hotels to understand market trends and customer behaviour better, guiding strategic decisions in marketing, expansion, and service enhancements. Additionally, the predictive analytics helps identify which services guests value most, allowing hotels to make informed investment decisions that improve guest satisfaction and drive revenue growth.

Strategies can be made by hotel using the model to reduce cancellation. One of the best strategies is proactively communicate with guests who are predicted to cancel by sending confirmation emails, reminders of their booking details, and information about the hotel's amenities and local attractions. Additionally, hotel can offer personalized incentives to these high-risk guests, such as room upgrades, dining discounts, or free access to premium services like spa treatments, to encourage them to maintain their reservations.

Another key strategy for reducing cancellations is to collect feedback from guests who cancel to understand their reasons and use this data to improve services and address common issues. 24/7 customer support should be provided to assist guests with any concerns or changes they might need, reducing those cancellations due to dissatisfaction or uncertainty. Lastly, continuously monitoring the booking and cancellation patterns using the model allows hotels to adjust strategies accordingly, while regularly updating the model with new data ensures predictions remain accurate.

5.3.2 Limitations and Challenges Faced

In summary, we selected Random Forest and Gradient Boosting algorithms for our dataset. While both algorithms have many strengths, there are several limitations to consider. For instance, overfitting is a particular concern with Gradient Boosting models. If not properly regularized, these models can show degraded performance on unseen data, as indicated by the increased number of false positives and false negatives.

Tree-based models offer feature importance interpretation scores, but these can be biased towards numerical or high cardinality features. This bias arises because such models may not accurately capture complex feature interactions. Moreover, the performance of these models heavily depends on the choice of hyperparameters. The sensitivity to hyperparameters makes the tuning process time-consuming and does not always guarantee the optimal configuration.

To address these limitations, we employed a robust data preprocessing pipeline to handle and transform the data into a suitable format for modelling. This pipeline ensured consistency and reliability by managing missing values, encoding categorical variables, and scaling numerical features. Visualizing individual decision trees from the models provided insights into how different features influenced predictions, which improved our understanding of model behaviour and helped identify key features.

Additionally, we used the ‘GridSearchCV’ process to tackle two significant challenges: excessive computational resources and hyperparameter tuning. Given the iterative nature of Gradient Boosting, training and tuning these models require substantial computational power and time. We mitigated this by using efficient data structures and parallel processing when possible. Also, it was utilised to efficiently search for the optimal hyperparameters in a structured and systematic manner.

6.0 Conclusion¶

In this report, we conducted a comprehensive analysis of the hotel booking dataset, exploring various aspects of booking behaviour and trends. Our analysis included six different visualizations: the number of cancellations, the booking ratio between the two hotel types, the percentage of bookings for each year, the month with the most bookings, the leading country as the source of guests, the analysis of the length of stay by hotel type, and the popular accommodation types (single, couple, family). These visualizations provided valuable insights into booking patterns and customer preferences.

Additionally, we trained the dataset using two machine learning algorithms: Random Forest and Gradient Boosting. The Random Forest model achieved an accuracy of 0.92, while the Gradient Boosting model had an accuracy of 0.90. These models were developed following a thorough data preprocessing workflow, which included feature engineering, data transformation, and data selection. This ensured that the models were trained on high-quality data, enhancing their predictive performance.

Our findings highlight key trends and factors influencing hotel bookings, offering actionable insights for improving booking strategies such as emphasizing on communication in customer service aspects, and personalized incentives for customer satisfaction. The high accuracy of our models further demonstrates the potential of machine learning in predicting booking outcomes, which can be leveraged to optimize hotel operations and marketing efforts.